diff mbox series

[2/2] index-pack: prefetch missing REF_DELTA bases

Message ID 4fcaa4481b5fd2a76aa21263f997e00913db0e0f.1557868134.git.jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series Partial clone fix: handling received REF_DELTA | expand

Commit Message

Jonathan Tan May 14, 2019, 9:10 p.m. UTC
When fetching, the client sends "have" commit IDs indicating that the
server does not need to send any object referenced by those commits,
reducing network I/O. When the client is a partial clone, the client
still sends "have"s in this way, even if it does not have every object
referenced by a commit it sent as "have".

If a server omits such an object, it is fine: the client could lazily
fetch that object before this fetch, and it can still do so after.

The issue is when the server sends a thin pack containing an object that
is a REF_DELTA against such a missing object: index-pack fails to fix
the thin pack. When support for lazily fetching missing objects was
added in 8b4c0103a9 ("sha1_file: support lazily fetching missing
objects", 2017-12-08), support in index-pack was turned off in the
belief that it accesses the repo only to do hash collision checks.
However, this is not true: it also needs to access the repo to resolve
REF_DELTA bases.

Support for lazy fetching should still generally be turned off in
index-pack because it is used as part of the lazy fetching process
itself (if not, infinite loops may occur), but we do need to fetch the
REF_DELTA bases. (When fetching REF_DELTA bases, it is unlikely that
those are REF_DELTA themselves, because we do not send "have" when
making such fetches.)

To resolve this, prefetch all missing REF_DELTA bases before attempting
to resolve them. This both ensures that all bases are attempted to be
fetched, and ensures that we make only one request per index-pack
invocation, and not one request per missing object.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/index-pack.c     | 26 +++++++++++++++--
 t/t5616-partial-clone.sh | 61 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+), 2 deletions(-)

Comments

Johannes Schindelin May 15, 2019, 8:46 a.m. UTC | #1
Hi Jonathan,

On Tue, 14 May 2019, Jonathan Tan wrote:

> When fetching, the client sends "have" commit IDs indicating that the
> server does not need to send any object referenced by those commits,
> reducing network I/O. When the client is a partial clone, the client
> still sends "have"s in this way, even if it does not have every object
> referenced by a commit it sent as "have".
>
> If a server omits such an object, it is fine: the client could lazily
> fetch that object before this fetch, and it can still do so after.
>
> The issue is when the server sends a thin pack containing an object that
> is a REF_DELTA against such a missing object: index-pack fails to fix
> the thin pack. When support for lazily fetching missing objects was
> added in 8b4c0103a9 ("sha1_file: support lazily fetching missing
> objects", 2017-12-08), support in index-pack was turned off in the
> belief that it accesses the repo only to do hash collision checks.
> However, this is not true: it also needs to access the repo to resolve
> REF_DELTA bases.
>
> Support for lazy fetching should still generally be turned off in
> index-pack because it is used as part of the lazy fetching process
> itself (if not, infinite loops may occur), but we do need to fetch the
> REF_DELTA bases. (When fetching REF_DELTA bases, it is unlikely that
> those are REF_DELTA themselves, because we do not send "have" when
> making such fetches.)
>
> To resolve this, prefetch all missing REF_DELTA bases before attempting
> to resolve them. This both ensures that all bases are attempted to be
> fetched, and ensures that we make only one request per index-pack
> invocation, and not one request per missing object.

Hmm. I wonder whether this can lead to *really* undesirable behavior, e.g.
with deep delta chains. The client would possibly have to fetch the
REF_DELTA object, but that would also be delivered in a thin pack with
*another* REF_DELTA object, and the same over and over again, with plenty
of round trips that kill performance really well.

Wouldn't it make more sense to introduce a new term like `promised`
(instead of `have`)? Both client and server will have to know about this,
and it would be a new capability, of course, but that way the server could
know that it has to send the entire delta chain.

Of course, this would be quite a bit more involved than the current patch
:-(

Ciao,
Dscho

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/index-pack.c     | 26 +++++++++++++++--
>  t/t5616-partial-clone.sh | 61 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 85 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index ccf4eb7e9b..0d55f73b0b 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -14,6 +14,7 @@
>  #include "thread-utils.h"
>  #include "packfile.h"
>  #include "object-store.h"
> +#include "fetch-object.h"
>
>  static const char index_pack_usage[] =
>  "git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
> @@ -1351,6 +1352,25 @@ static void fix_unresolved_deltas(struct hashfile *f)
>  		sorted_by_pos[i] = &ref_deltas[i];
>  	QSORT(sorted_by_pos, nr_ref_deltas, delta_pos_compare);
>
> +	if (repository_format_partial_clone) {
> +		/*
> +		 * Prefetch the delta bases.
> +		 */
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +		for (i = 0; i < nr_ref_deltas; i++) {
> +			struct ref_delta_entry *d = sorted_by_pos[i];
> +			if (!oid_object_info_extended(the_repository, &d->oid,
> +						      NULL,
> +						      OBJECT_INFO_FOR_PREFETCH))
> +				continue;
> +			oid_array_append(&to_fetch, &d->oid);
> +		}
> +		if (to_fetch.nr)
> +			fetch_objects(repository_format_partial_clone,
> +				      to_fetch.oid, to_fetch.nr);
> +		oid_array_clear(&to_fetch);
> +	}
> +
>  	for (i = 0; i < nr_ref_deltas; i++) {
>  		struct ref_delta_entry *d = sorted_by_pos[i];
>  		enum object_type type;
> @@ -1650,8 +1670,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  	int report_end_of_input = 0;
>
>  	/*
> -	 * index-pack never needs to fetch missing objects, since it only
> -	 * accesses the repo to do hash collision checks
> +	 * index-pack never needs to fetch missing objects except when
> +	 * REF_DELTA bases are missing (which are explicitly handled). It only
> +	 * accesses the repo to do hash collision checks and to check which
> +	 * REF_DELTA bases need to be fetched.
>  	 */
>  	fetch_if_missing = 0;
>
> diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> index 7cc0c71556..f1baf83502 100755
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -339,4 +339,65 @@ test_expect_success 'when partial cloning, tolerate server not sending target of
>  	! test -e "$HTTPD_ROOT_PATH/one-time-sed"
>  '
>
> +test_expect_success 'tolerate server sending REF_DELTA against missing promisor objects' '
> +	SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
> +	rm -rf "$SERVER" repo &&
> +	test_create_repo "$SERVER" &&
> +	test_config -C "$SERVER" uploadpack.allowfilter 1 &&
> +	test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 &&
> +
> +	# Create a commit with a blob to be used as a delta base.
> +	for i in $(test_seq 10)
> +	do
> +		echo "this is a line" >>"$SERVER/foo.txt"
> +	done &&
> +	git -C "$SERVER" add foo.txt &&
> +	git -C "$SERVER" commit -m bar &&
> +	git -C "$SERVER" rev-parse HEAD:foo.txt >deltabase &&
> +
> +	git -c protocol.version=2 clone --no-checkout \
> +		--filter=blob:none $HTTPD_URL/one_time_sed/server repo &&
> +
> +	# Sanity check to ensure that the client does not have that blob.
> +	git -C repo rev-list --objects --exclude-promisor-objects \
> +		-- $(cat deltabase) >objlist &&
> +	test_line_count = 0 objlist &&
> +
> +	# Another commit. This commit will be fetched by the client.
> +	echo "abcdefghijklmnopqrstuvwxyz" >>"$SERVER/foo.txt" &&
> +	git -C "$SERVER" add foo.txt &&
> +	git -C "$SERVER" commit -m baz &&
> +
> +	# Pack a thin pack containing, among other things, HEAD:foo.txt
> +	# delta-ed against HEAD^:foo.txt.
> +	printf "%s\n--not\n%s\n" \
> +		$(git -C "$SERVER" rev-parse HEAD) \
> +		$(git -C "$SERVER" rev-parse HEAD^) |
> +		git -C "$SERVER" pack-objects --thin --stdout >thin.pack &&
> +
> +	# Ensure that the pack contains one delta against HEAD^:foo.txt. Since
> +	# the delta contains at least 26 novel characters, the size cannot be
> +	# contained in 4 bits, so the object header will take up 2 bytes. The
> +	# most significant nybble of the first byte is 0b1111 (0b1 to indicate
> +	# that the header continues, and 0b111 to indicate REF_DELTA), followed
> +	# by any 3 nybbles, then the OID of the delta base.
> +	git -C "$SERVER" rev-parse HEAD^:foo.txt >deltabase &&
> +	printf "f.,..%s" $(intersperse "," <deltabase) >want &&
> +	hex_unpack <thin.pack | intersperse "," >have &&
> +	grep $(cat want) have &&
> +
> +	replace_packfile thin.pack &&
> +
> +	# Use protocol v2 because the sed command looks for the "packfile"
> +	# section header.
> +	test_config -C "$SERVER" protocol.version 2 &&
> +
> +	# Fetch the thin pack and ensure that index-pack is able to handle the
> +	# REF_DELTA object with a missing promisor delta base.
> +	git -C repo -c protocol.version=2 fetch &&
> +
> +	# Ensure that the one-time-sed script was used.
> +	! test -e "$HTTPD_ROOT_PATH/one-time-sed"
> +'
> +
>  test_done
> --
> 2.21.0.1020.gf2820cf01a-goog
>
>
Jonathan Tan May 15, 2019, 6:28 p.m. UTC | #2
> > To resolve this, prefetch all missing REF_DELTA bases before attempting
> > to resolve them. This both ensures that all bases are attempted to be
> > fetched, and ensures that we make only one request per index-pack
> > invocation, and not one request per missing object.
> 
> Hmm. I wonder whether this can lead to *really* undesirable behavior, e.g.
> with deep delta chains. The client would possibly have to fetch the
> REF_DELTA object, but that would also be delivered in a thin pack with
> *another* REF_DELTA object, and the same over and over again, with plenty
> of round trips that kill performance really well.

When the client fetches the REF_DELTA base, it won't be a REF_DELTA
object itself because Git makes these fetches without any "have" lines,
so the server doesn't know anything to delta against. Admittedly, this
is just due how to we implemented it - if later we find a way to
optimize the lazy fetches by adding "have", then we'll have to revisit
this.

Quoting from the commit message:

> > (When fetching REF_DELTA bases, it is unlikely that
> > those are REF_DELTA themselves, because we do not send "have" when
> > making such fetches.)

I tried to address this point with this sentence in the commit message.
If you think that this should be addressed more clearly in the commit
message, let me know if you have any suggestions.

> Wouldn't it make more sense to introduce a new term like `promised`
> (instead of `have`)? Both client and server will have to know about this,
> and it would be a new capability, of course, but that way the server could
> know that it has to send the entire delta chain.
> 
> Of course, this would be quite a bit more involved than the current patch
> :-(

I think this can also be solved by omitting "thin-pack". We might want
to do this once we optimize the lazy fetches by adding "have".

Thanks for taking a look at this.
Jeff King May 15, 2019, 11:16 p.m. UTC | #3
On Tue, May 14, 2019 at 02:10:55PM -0700, Jonathan Tan wrote:

> Support for lazy fetching should still generally be turned off in
> index-pack because it is used as part of the lazy fetching process
> itself (if not, infinite loops may occur), but we do need to fetch the
> REF_DELTA bases. (When fetching REF_DELTA bases, it is unlikely that
> those are REF_DELTA themselves, because we do not send "have" when
> making such fetches.)

I agree that the current implementation (and probably any sane
implementation) would not send us a delta if we have not provided any
haves. But this does mean that a malicious server could send a client
into an infinite loop.

Pretty unlikely, but should we put some kind of circuit-breaker into the
client to ensure this?

> To resolve this, prefetch all missing REF_DELTA bases before attempting
> to resolve them. This both ensures that all bases are attempted to be
> fetched, and ensures that we make only one request per index-pack
> invocation, and not one request per missing object.

Ah, but now things get more tricky.

You are assuming that the server does not ever send a REF_DELTA unless
the base object is not present in the pack (it would use OFS_DELTA
instead). If we imagine a server which did, then there are two
implications:

  1. We might pre-fetch a full copy of an object that we don't need.
     It's just that it's stored as a delta in the pack which we are
     currently indexing.

  2. If we pre-fetch multiple objects, some of them may be REF_DELTAs
     against each other, leading to an infinite loop.

Off the top of my head, I am pretty sure your assumption holds for all
versions of Git that support delta-base-offset[1]. But that feels a lot
less certain to me. I could imagine an alternate server implementation,
for example, that is gluing together packs and does not try hard to
order the base before the delta, which would require it to use REF_DELTA
instead of OFS_DELTA.

That's sort of contrived, but it does feel like we're introducing a
really subtle requirement on the server here, which might close off
options to us in the future.

Unfortunately, I can't really think of a way for an existing client to
solve this without doing individual fetches for each REF_DELTA as we
encounter a need for it. In fact, even then we may lose if our ordering
is unlucky. E.g., imagine we have two packfile entries whose object ids
(which we don't know yet!) are X and Y, and they are stored as
REF_DELTAs with bases Z and X, respectively. Then either:

  1. We try to resolve X first. We fetch Z on-demand, and reconstruct X.
     We're lucky, because when we try to resolve Y, we see that we
     already have its base X.

  2. We try to resolve Y first. We fetch X on-demand, and reconstruct Y.
     We're unlucky; we then resolve X, but only after on-demand fetching
     Z and reconstructing it do we realize that we already had it.

So really, pre-fetching all of the REF_DELTAs just means we always hit
the unlucky case. But even with careful on-demand fetching, our worst
case is the same (and even worse in terms of latency).

I dunno. Maybe we should just ignore it. It's a fundamental issue with
partial clones that we're going to have to fetch extra junk here anyway,
because what the server _optimally_ would do is not send us deltas
against objects we don't have anyway. We just don't have an efficient
way to tell it what we do have.

If we're willing to modify the format, one thing we _could_ do is have
the server communicate the expectations for each base. I.e., introduce a
new THIN_DELTA type that behaves exactly as a REF_DELTA, but with the
extra 1-bit of knowledge that the server knows it is not including the
base in the pack. I'm not sure how painful that retro-fitting would be.
It would need at least a new capability and options to pack-objects and
index-pack. We might be tight on bits in the packfile type field.

-Peff

[1] Of course there are versions of Git that don't support
    delta-base-offset. They wouldn't support partial clones either, but
    it's possible to fetch into a partially-cloned repository from
    another remote. Given how old and rare such versions are, I think we
    can probably discount it entirely. I'm much more concerned about
    alternate implementations, or trying our hands for future
    pack-objects optimizations.
Junio C Hamano May 16, 2019, 1:43 a.m. UTC | #4
Jeff King <peff@peff.net> writes:

> I agree that the current implementation (and probably any sane
> implementation) would not send us a delta if we have not provided any
> haves. But this does mean that a malicious server could send a client
> into an infinite loop.
>
> Pretty unlikely, but should we put some kind of circuit-breaker into the
> client to ensure this?

That's a pretty good point.  Would it be suffice to have a new
option to tell index-pack that fattens a thin pack and unpack-objects
that expands objects in a small incoming packfile into loose objects
that they are forbidden from on-demand fatching during this invocation,
as it is an error for the packfile they are digesting to depend on a
lazy objects?

> I dunno. Maybe we should just ignore it. It's a fundamental issue with
> partial clones that we're going to have to fetch extra junk here anyway,

Would it be an option not to ask for a thin pack in the first place?

> If we're willing to modify the format, one thing we _could_ do is have
> the server communicate the expectations for each base. I.e., introduce a
> new THIN_DELTA type that behaves exactly as a REF_DELTA, but with the
> extra 1-bit of knowledge that the server knows it is not including the
> base in the pack. I'm not sure how painful that retro-fitting would be.
> It would need at least a new capability and options to pack-objects and
> index-pack. We might be tight on bits in the packfile type field.

The type field is tight, but I wonder how much such a new
representation would help.  Unless the receiving end blindly trusts
what the sender says, there needs to be a logic to detect cyclic
dependencies while following such a delta chain to lazy-fill
promised objects on the receiving end anyway, no?
Jeff King May 16, 2019, 4:04 a.m. UTC | #5
On Thu, May 16, 2019 at 10:43:03AM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I agree that the current implementation (and probably any sane
> > implementation) would not send us a delta if we have not provided any
> > haves. But this does mean that a malicious server could send a client
> > into an infinite loop.
> >
> > Pretty unlikely, but should we put some kind of circuit-breaker into the
> > client to ensure this?
> 
> That's a pretty good point.  Would it be suffice to have a new
> option to tell index-pack that fattens a thin pack and unpack-objects
> that expands objects in a small incoming packfile into loose objects
> that they are forbidden from on-demand fatching during this invocation,
> as it is an error for the packfile they are digesting to depend on a
> lazy objects?

Yeah, that's what I was thinking. The parent index-pack sets an
environment variable for "do not recurse to get partials" when it calls
fetch. But that said...

> > I dunno. Maybe we should just ignore it. It's a fundamental issue with
> > partial clones that we're going to have to fetch extra junk here anyway,
> 
> Would it be an option not to ask for a thin pack in the first place?

Yes, that seems much simpler. The flag is really "do not ask for a thin
pack, and do not pass --fix-thin to index-pack". And this recursion
should not kick in when --fix-thin is not in effect (I didn't check
Jonathan's patch to see whether that is the case, but I think that ought
to be the rule regardless of how we decide to address the recursion
issue).

One snag: I don't think unpack-objects understands --fix-thin. It just
looks in the object database and finds either a recently-written object
or one we already have, and it doesn't care about the difference.

For that matter, does unpack-objects need the same treatment here? I
guess not, because it does not disable fetch_if_missing in the same way.
I guess it is already susceptible to the infinite-recursion thing, then.
Or maybe not. We always use index-pack for promisor remotes. So even
though the _first_ fetch which yields REF_DELTA may be from a
non-promisor, any subsequent one would be. So we'd never recurse more
than once. So I think the emergent behavior does what we want. ;)

I worked up some patches a while ago to try to replace unpack-objects
with an index-pack mode that explodes objects (just so we could stop
keeping two almost-the-same code bases around, both of which are
security-critical as they take in untrusted objects, and one of which
clearly gets a lot more attention than the other). But I got hung up a
bit on their strategies for handling base objects. IIRC, the
unpack-objects one does things in a different order that's more
efficient for some cases, and I worried that somebody would care.

I can resurrect that if there's interest (though I do think by the
reasoning above that it's orthogonal to this particular patch series).

> > If we're willing to modify the format, one thing we _could_ do is have
> > the server communicate the expectations for each base. I.e., introduce a
> > new THIN_DELTA type that behaves exactly as a REF_DELTA, but with the
> > extra 1-bit of knowledge that the server knows it is not including the
> > base in the pack. I'm not sure how painful that retro-fitting would be.
> > It would need at least a new capability and options to pack-objects and
> > index-pack. We might be tight on bits in the packfile type field.
> 
> The type field is tight, but I wonder how much such a new
> representation would help.  Unless the receiving end blindly trusts
> what the sender says, there needs to be a logic to detect cyclic
> dependencies while following such a delta chain to lazy-fill
> promised objects on the receiving end anyway, no?

It would let the client know how the server expects it to handle each
delta. The client can then (without trusting the server) say:

  - this is REF_DELTA; the server claims that the base is in this pack,
    so I will not prefetch it. If it turns out not to be, then it is an
    error and I will reject the pack (and _not_ try to fetch it on
    demand)

  - this is THIN_DELTA; I will try to fetch it from the promisor remote,
    and not recurse (because the promisor fetch will not ask for a
    thin-pack). If it also turns out to be in the pack, so be it. We'll
    have two copies and will have wasted the server's bandwidth.

So the client gets to optimize by following the server's directions, but
the worst case for a lying server is not a big deal (it's just more
pessimal).

-Peff
Jonathan Tan May 16, 2019, 6:26 p.m. UTC | #6
> On Tue, May 14, 2019 at 02:10:55PM -0700, Jonathan Tan wrote:
> 
> > Support for lazy fetching should still generally be turned off in
> > index-pack because it is used as part of the lazy fetching process
> > itself (if not, infinite loops may occur), but we do need to fetch the
> > REF_DELTA bases. (When fetching REF_DELTA bases, it is unlikely that
> > those are REF_DELTA themselves, because we do not send "have" when
> > making such fetches.)
> 
> I agree that the current implementation (and probably any sane
> implementation) would not send us a delta if we have not provided any
> haves. But this does mean that a malicious server could send a client
> into an infinite loop.
> 
> Pretty unlikely, but should we put some kind of circuit-breaker into the
> client to ensure this?

I thought of this - such a server could, but it seems to me that it
would be similar to a server streaming random bytes to us without
stopping (which is already possible).

> > To resolve this, prefetch all missing REF_DELTA bases before attempting
> > to resolve them. This both ensures that all bases are attempted to be
> > fetched, and ensures that we make only one request per index-pack
> > invocation, and not one request per missing object.
> 
> Ah, but now things get more tricky.
> 
> You are assuming that the server does not ever send a REF_DELTA unless
> the base object is not present in the pack (it would use OFS_DELTA
> instead). If we imagine a server which did, then there are two
> implications:
> 
>   1. We might pre-fetch a full copy of an object that we don't need.
>      It's just that it's stored as a delta in the pack which we are
>      currently indexing.
> 
>   2. If we pre-fetch multiple objects, some of them may be REF_DELTAs
>      against each other, leading to an infinite loop.
> 
> Off the top of my head, I am pretty sure your assumption holds for all
> versions of Git that support delta-base-offset[1]. But that feels a lot
> less certain to me. I could imagine an alternate server implementation,
> for example, that is gluing together packs and does not try hard to
> order the base before the delta, which would require it to use REF_DELTA
> instead of OFS_DELTA.

A cursory glance makes me think that REF_DELTA against a base object
also in the pack is already correctly handled. Right before the
invocation of conclude_pack() (which calls fix_unresolved_deltas(), the
function I modified), resolve_deltas() is invoked. The latter invokes
resolve_base() (directly or through threaded_second_pass()) which
invokes find_unresolved_deltas(), which invokes
find_unresolved_deltas_1(), which seems to handle both REF_DELTA and
OFS_DELTA.

Snipping the rest as I don't think we need to solve those if we can
handle REF_DELTA being against an object in a pack, but let me know if
you think that some of the points there still need to be addressed.
Jeff King May 16, 2019, 9:12 p.m. UTC | #7
On Thu, May 16, 2019 at 11:26:46AM -0700, Jonathan Tan wrote:

> > Pretty unlikely, but should we put some kind of circuit-breaker into the
> > client to ensure this?
> 
> I thought of this - such a server could, but it seems to me that it
> would be similar to a server streaming random bytes to us without
> stopping (which is already possible).

True. I was thinking mainly of the infinite-redirection protections we
put in place for https. But I agree that in general, since we don't have
inherent limits on the size of workloads, that servers can already troll
clients pretty hard in a variety of ways.

So I could go either way, though I do think it makes sense for on-demand
fetches for partial clones to avoid asking for thin packs as a general
principle. As a matter of fact, should partial clones _always_ avoid
asking for thin packs?  That would make this issue go away entirely.

Sometimes it would be more efficient (we do not have to get an extra
base object just to resolve the delta we needed) but sometimes worse (if
we did actually have the base, it's a win). Whether it's a win would
depend on the "hit" rate, and I suspect that is heavily dependent on
workload characteristics (what kind of filtering is in use, are we
topping up in a non-partial way, etc).

> > Off the top of my head, I am pretty sure your assumption holds for all
> > versions of Git that support delta-base-offset[1]. But that feels a lot
> > less certain to me. I could imagine an alternate server implementation,
> > for example, that is gluing together packs and does not try hard to
> > order the base before the delta, which would require it to use REF_DELTA
> > instead of OFS_DELTA.
> 
> A cursory glance makes me think that REF_DELTA against a base object
> also in the pack is already correctly handled. Right before the
> invocation of conclude_pack() (which calls fix_unresolved_deltas(), the
> function I modified), resolve_deltas() is invoked. The latter invokes
> resolve_base() (directly or through threaded_second_pass()) which
> invokes find_unresolved_deltas(), which invokes
> find_unresolved_deltas_1(), which seems to handle both REF_DELTA and
> OFS_DELTA.
> 
> Snipping the rest as I don't think we need to solve those if we can
> handle REF_DELTA being against an object in a pack, but let me know if
> you think that some of the points there still need to be addressed.

Right, REF_DELTA is definitely correctly handled currently, and I don't
think that would break with your patch. It's just that your patch would
introduce a bunch of extra traffic as we request bases separately that
are already in the pack.

-Peff
Jonathan Tan May 16, 2019, 9:30 p.m. UTC | #8
> On Thu, May 16, 2019 at 11:26:46AM -0700, Jonathan Tan wrote:
> 
> > > Pretty unlikely, but should we put some kind of circuit-breaker into the
> > > client to ensure this?
> > 
> > I thought of this - such a server could, but it seems to me that it
> > would be similar to a server streaming random bytes to us without
> > stopping (which is already possible).
> 
> True. I was thinking mainly of the infinite-redirection protections we
> put in place for https. But I agree that in general, since we don't have
> inherent limits on the size of workloads, that servers can already troll
> clients pretty hard in a variety of ways.
> 
> So I could go either way, though I do think it makes sense for on-demand
> fetches for partial clones to avoid asking for thin packs as a general
> principle.

This should not be a problem since fetch-pack can already know that
we're doing an on-demand fetch (args->no_dependents), so we should be
able to either plumb a "no-thin-pack" arg in the same way or rename
args->no_dependents to also encompass the no-thin-pack option. But this
can be done separately from this patch set, I think.

> As a matter of fact, should partial clones _always_ avoid
> asking for thin packs?  That would make this issue go away entirely.
> 
> Sometimes it would be more efficient (we do not have to get an extra
> base object just to resolve the delta we needed) but sometimes worse (if
> we did actually have the base, it's a win). Whether it's a win would
> depend on the "hit" rate, and I suspect that is heavily dependent on
> workload characteristics (what kind of filtering is in use, are we
> topping up in a non-partial way, etc).

I think it's best if we still allow servers to serve thin packs. For
example, if we're excluding only large blobs, clients would still want
servers to be able to delta against blobs that they have.

> Right, REF_DELTA is definitely correctly handled currently, and I don't
> think that would break with your patch. It's just that your patch would
> introduce a bunch of extra traffic as we request bases separately that
> are already in the pack.

Ah...I see. For this problem, I think that it can be solved with the
"if (objects[d->obj_no].real_type != OBJ_REF_DELTA)" check that the
existing code uses before calling read_object(). I'll include this in
the next reroll if any other issue comes up.
Jeff King May 16, 2019, 9:42 p.m. UTC | #9
On Thu, May 16, 2019 at 02:30:56PM -0700, Jonathan Tan wrote:

> > So I could go either way, though I do think it makes sense for on-demand
> > fetches for partial clones to avoid asking for thin packs as a general
> > principle.
> 
> This should not be a problem since fetch-pack can already know that
> we're doing an on-demand fetch (args->no_dependents), so we should be
> able to either plumb a "no-thin-pack" arg in the same way or rename
> args->no_dependents to also encompass the no-thin-pack option. But this
> can be done separately from this patch set, I think.

Yeah, I think it can be done separately. Though the two may intermingle
if we want to instruct index-pack that it should not try to pre-fetch if
we did not ask for a thin pack.

> > As a matter of fact, should partial clones _always_ avoid
> > asking for thin packs?  That would make this issue go away entirely.
> > 
> > Sometimes it would be more efficient (we do not have to get an extra
> > base object just to resolve the delta we needed) but sometimes worse (if
> > we did actually have the base, it's a win). Whether it's a win would
> > depend on the "hit" rate, and I suspect that is heavily dependent on
> > workload characteristics (what kind of filtering is in use, are we
> > topping up in a non-partial way, etc).
> 
> I think it's best if we still allow servers to serve thin packs. For
> example, if we're excluding only large blobs, clients would still want
> servers to be able to delta against blobs that they have.

Yes, this is getting into the hit-rate thing I mentioned. You're right
that for a reasonably typical case of "no blobs over 10MB" we'd have a
very high hit rate, and disabling thin packs would almost certainly be a
big loss.

I guess even when we have a "miss", the cost is usually not that high
either. If we get A as a delta against B, then in the non-thin-pack case
we transfer all of A. In the thin-pack case with pre-fetch we transfer
all of B, and then the delta. But the delta is often small enough
compared to the total content that it's not that big a deal either way.
There are pathological cases, of course, but that's already true. :)

So you're right, it's probably still a win to use thin packs when we
can.

> > Right, REF_DELTA is definitely correctly handled currently, and I don't
> > think that would break with your patch. It's just that your patch would
> > introduce a bunch of extra traffic as we request bases separately that
> > are already in the pack.
> 
> Ah...I see. For this problem, I think that it can be solved with the
> "if (objects[d->obj_no].real_type != OBJ_REF_DELTA)" check that the
> existing code uses before calling read_object(). I'll include this in
> the next reroll if any other issue comes up.

I'm confused about this. Aren't we pre-fetching before we've actually
resolved deltas? The base could be in the pack as a true base, and we
might have seen it already then. But it could itself be a delta, and we
wouldn't know we have it until we resolve it (this gets into the
lucky/unlucky ordering thing).

-Peff
Jonathan Tan May 16, 2019, 11:15 p.m. UTC | #10
> > > Right, REF_DELTA is definitely correctly handled currently, and I don't
> > > think that would break with your patch. It's just that your patch would
> > > introduce a bunch of extra traffic as we request bases separately that
> > > are already in the pack.
> > 
> > Ah...I see. For this problem, I think that it can be solved with the
> > "if (objects[d->obj_no].real_type != OBJ_REF_DELTA)" check that the
> > existing code uses before calling read_object(). I'll include this in
> > the next reroll if any other issue comes up.
> 
> I'm confused about this. Aren't we pre-fetching before we've actually
> resolved deltas? The base could be in the pack as a true base, and we
> might have seen it already then. But it could itself be a delta, and we
> wouldn't know we have it until we resolve it (this gets into the
> lucky/unlucky ordering thing).

resolve_deltas(), invoked before any new code introduced in this patch,
has this comment:

> /*
>  * Second pass:
>  * - for all non-delta objects, look if it is used as a base for
>  *   deltas;
>  * - if used as a base, uncompress the object and apply all deltas,
>  *   recursively checking if the resulting object is used as a base
>  *   for some more deltas.
>  */

I haven't seen any code that contradicts this comment. And looking at
the code, for each non-delta object, I think that all deltas are checked
- regardless of whether they appear before or after that non-delta
object. (find_ref_delta() does a binary search from 0 to
nr_ref_deltas, calculated in parse_pack_objects() which happens before
any resolution of deltas.)

And find_unresolved_deltas_1() (called from resolve_deltas() indirectly)
sets the real_type when it resolves a delta, as far as I can tell.

So there is more than one "resolve deltas" step - resolve_deltas() and
then fix_unresolved_deltas(). The pre-fetching happens only during the
latter.
Jeff King May 17, 2019, 1:09 a.m. UTC | #11
On Thu, May 16, 2019 at 04:15:09PM -0700, Jonathan Tan wrote:

> > /*
> >  * Second pass:
> >  * - for all non-delta objects, look if it is used as a base for
> >  *   deltas;
> >  * - if used as a base, uncompress the object and apply all deltas,
> >  *   recursively checking if the resulting object is used as a base
> >  *   for some more deltas.
> >  */
> 
> I haven't seen any code that contradicts this comment. And looking at
> the code, for each non-delta object, I think that all deltas are checked
> - regardless of whether they appear before or after that non-delta
> object. (find_ref_delta() does a binary search from 0 to
> nr_ref_deltas, calculated in parse_pack_objects() which happens before
> any resolution of deltas.)
> 
> And find_unresolved_deltas_1() (called from resolve_deltas() indirectly)
> sets the real_type when it resolves a delta, as far as I can tell.
> 
> So there is more than one "resolve deltas" step - resolve_deltas() and
> then fix_unresolved_deltas(). The pre-fetching happens only during the
> latter.

OK, that is better than I realized. But I think there is still one
interesting case: what happens if a thin delta is used as the base for a
non-thin delta (even one that is OFS_DELTA)?

We cannot handle that second delta until the third pass in
fix_unresolved_deltas(), because we do not realize we have the base
until we resolve the thin delta.

Specifically, I wonder:

  - do we pre-fetch it, not realizing that we will soon have the base?
    That's not ideal, but I think is necessary if we pre-fetch (and is
    still a worst case even if we did single on-demand fetches).

  - will we ever append a presumed-thin base to the pack, only to later
    realize that we already have that object, creating a duplicate
    object in the pack? If so, do we handle this correctly when
    generating the index (I know we've had issues in the past and have
    expressly forbidden duplicates from appearing in the index; even
    having a duplicate in the pack stream itself is non-ideal, though,
    as it screws up things like on-disk size calculations).

    Because of the sorting in fix_unresolved_deltas(), I think this
    could easily be prevented if the non-thin delta is OFS_DELTA (by
    just checking for the base in our already-found list of objects
    before we call read_object_file(). But for REF_DELTA, I think we
    have no way of knowing that appending is the wrong thing (and no
    good way of backing it out afterwards).

-Peff
Jeff King May 17, 2019, 1:22 a.m. UTC | #12
On Thu, May 16, 2019 at 09:09:50PM -0400, Jeff King wrote:

>   - will we ever append a presumed-thin base to the pack, only to later
>     realize that we already have that object, creating a duplicate
>     object in the pack? If so, do we handle this correctly when
>     generating the index (I know we've had issues in the past and have
>     expressly forbidden duplicates from appearing in the index; even
>     having a duplicate in the pack stream itself is non-ideal, though,
>     as it screws up things like on-disk size calculations).
> 
>     Because of the sorting in fix_unresolved_deltas(), I think this
>     could easily be prevented if the non-thin delta is OFS_DELTA (by
>     just checking for the base in our already-found list of objects
>     before we call read_object_file(). But for REF_DELTA, I think we
>     have no way of knowing that appending is the wrong thing (and no
>     good way of backing it out afterwards).

Actually, I think even for REF_DELTA our pack-objects would never
produce such a pack, because IIRC we _always_ put bases in the pack
before their deltas. But that's a pretty subtle thing to depend on. I'm
fine with it if violating it just means things are slightly less
optimal. I'm more worried if it means that index-pack silently produces
a bogus pack.

I think to trigger it you'd have to manually assemble an evil pack as I
described (e.g., using the routines in t/lib-pack.sh). I'm going offline
for a bit, but I may have a go at it later tonight or tomorrow.

-Peff
Jeff King May 17, 2019, 4:39 a.m. UTC | #13
On Thu, May 16, 2019 at 09:22:34PM -0400, Jeff King wrote:

> On Thu, May 16, 2019 at 09:09:50PM -0400, Jeff King wrote:
> 
> >   - will we ever append a presumed-thin base to the pack, only to later
> >     realize that we already have that object, creating a duplicate
> >     object in the pack? If so, do we handle this correctly when
> >     generating the index (I know we've had issues in the past and have
> >     expressly forbidden duplicates from appearing in the index; even
> >     having a duplicate in the pack stream itself is non-ideal, though,
> >     as it screws up things like on-disk size calculations).
> > 
> >     Because of the sorting in fix_unresolved_deltas(), I think this
> >     could easily be prevented if the non-thin delta is OFS_DELTA (by
> >     just checking for the base in our already-found list of objects
> >     before we call read_object_file(). But for REF_DELTA, I think we
> >     have no way of knowing that appending is the wrong thing (and no
> >     good way of backing it out afterwards).
> 
> Actually, I think even for REF_DELTA our pack-objects would never
> produce such a pack, because IIRC we _always_ put bases in the pack
> before their deltas. But that's a pretty subtle thing to depend on. I'm
> fine with it if violating it just means things are slightly less
> optimal. I'm more worried if it means that index-pack silently produces
> a bogus pack.
> 
> I think to trigger it you'd have to manually assemble an evil pack as I
> described (e.g., using the routines in t/lib-pack.sh). I'm going offline
> for a bit, but I may have a go at it later tonight or tomorrow.

OK, doing so wasn't _too_ bad, though I did have to resurrect the
horrible generator patch from [1]. My results are below. But more
importantly, there is good news.

As it turns out, index-pack does not handle these complicated cases at
all! In the final fix_unresolved_deltas(), we are only looking for thin
deltas, and anything that was not yet resolved is assumed to be a thin
object. In many of these cases we _could_ resolve them if we tried
harder. But that is good news for us because it means that these
expectations about delta relationships are already there, and the
pre-fetch done by your patch should always be 100% correct and
efficient.

If you knew this already and were wondering what I've been babbling
about this whole time, then apologies for the noise. If not, then I hope
you enjoyed this deep dive into the inner workings of our delta
resolution. :) This was a part of index-pack I hadn't really had to
touch before, so I learned a lot.

Here's the script I used to test this (the comments explaining what's
going on were obviously written _after_ I ran it and figured out why it
didn't work; I had initially expected the first one to pass).

I don't really think it's worth carrying these tests in our tree. I'm
mostly just showing my work.

-- >8 --
diff --git a/t/lib-pack.sh b/t/lib-pack.sh
index c4d907a450..43785335e0 100644
--- a/t/lib-pack.sh
+++ b/t/lib-pack.sh
@@ -77,6 +77,18 @@ pack_obj () {
 			;;
 		esac
 		;;
+
+	# blob containing "\7\1"
+	0231abe3332fca6abc6a7e0f8000473692ce8c83)
+		case "$2" in
+		01d7713666f4de822776c7622c10f1b07de280dc)
+			printf '\165\2\61\253\343\63\57\312\152\274\152\176' &&
+			printf '\17\200\0\107\66\222\316\214\203\170\234' &&
+			printf '\143\142\142\142\147\0\0\0\53\0\16'
+			return
+			;;
+		esac
+		;;
 	esac
 
 	# If it's not a delta, we can convince pack-objects to generate a pack
diff --git a/t/t1234-foo.sh b/t/t1234-foo.sh
new file mode 100755
index 0000000000..22f1e85f81
--- /dev/null
+++ b/t/t1234-foo.sh
@@ -0,0 +1,61 @@
+#!/bin/sh
+
+test_description='delta resolution torture tests'
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-pack.sh
+
+# blobs that lib-pack.sh knows about
+A=0231abe3332fca6abc6a7e0f8000473692ce8c83
+B=01d7713666f4de822776c7622c10f1b07de280dc
+C=e68fe8129b546b101aee9510c5328e7f21ca1d18
+
+test_expect_success 'create local copy of object C' '
+	# contents from lib-pack.sh
+	printf "\\7\\76" | git hash-object -w --stdin
+'
+
+# Create a pack with a delta chain A->B->C. The important things are:
+#
+#   - these are all REF_DELTA, which is what lib-pack produces
+#
+#   - the delta from $C is thin; the receiver should have that object
+#
+#   - the base (if present) comes _before_ the delta in the file (we assume
+#     that the final delta resolution uses pack order to break ties)
+test_expect_success 'create thin pack' '
+	{
+		pack_header 2 &&
+		pack_obj $B $C &&
+		pack_obj $A $B
+	} >ba.pack &&
+	pack_trailer ba.pack
+'
+
+# In theory this could work if we noticed that we had just generated $B.
+# However, fix_unresolved_deltas() never looks in the pack; it assumes anything
+# we found at that point is thin, and looks only at our regular object
+# database for the base.
+test_expect_failure 'index ba.pack' '
+	git index-pack --fix-thin --stdin <ba.pack
+'
+
+# Same pack, but order reversed; the base comes after the delta.
+test_expect_success 'create thin pack' '
+	{
+		pack_header 2 &&
+		pack_obj $A $B &&
+		pack_obj $B $C
+	} >ab.pack &&
+	pack_trailer ab.pack
+'
+
+# This one is even less likely to work, because when we are processing $A and
+# look for its base $B, we would not yet have processed $B. So we know nothing
+# about it.  It would take two passes (and in the general case, up to O(n)
+# passes) to resolve everything (assuming we even start looking in the current
+# pack for objects, as ba.pack would require).
+test_expect_failure 'index ab.pack' '
+	git index-pack --fix-thin --stdin <ab.pack
+'
+
+test_done
Jeff King May 17, 2019, 4:42 a.m. UTC | #14
On Fri, May 17, 2019 at 12:39:39AM -0400, Jeff King wrote:

> > I think to trigger it you'd have to manually assemble an evil pack as I
> > described (e.g., using the routines in t/lib-pack.sh). I'm going offline
> > for a bit, but I may have a go at it later tonight or tomorrow.
> 
> OK, doing so wasn't _too_ bad, though I did have to resurrect the
> horrible generator patch from [1]. My results are below. But more
> importantly, there is good news.

Oh, I forgot my [1]. It was this message:

  https://public-inbox.org/git/20130823182409.GA30130@sigill.intra.peff.net/

which explained how I hacked up pack-objects to generate the binary goo
in lib-pack.sh. Here's my forward-ported version of it, in case we need
it again.

I do think it would be useful for debugging and experimenting to have a
mode for pack-objects that takes in a list of instructions (add object
X, make it a delta against object Y) that skips its usual ordering and
delta constraints. This isn't it, but it's enough to hackily get what
you need. ;)

---
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 9f424aabab..87955d9404 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3234,6 +3234,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	struct argv_array rp = ARGV_ARRAY_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
 	int rev_list_index = 0;
+	int magic = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
@@ -3321,6 +3322,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("do not pack objects in promisor packfiles")),
 		OPT_BOOL(0, "delta-islands", &use_delta_islands,
 			 N_("respect islands during delta compression")),
+		OPT_BOOL(0, "magic", &magic, "make deltas"),
 		OPT_END(),
 	};
 
@@ -3337,6 +3339,34 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	argc = parse_options(argc, argv, prefix, pack_objects_options,
 			     pack_usage, 0);
 
+	if (magic) {
+		struct object_id oid;
+		struct delta_index *index;
+		void *src, *trg, *delta;
+		enum object_type src_type, trg_type;
+		unsigned long src_size, trg_size, delta_size, z_delta_size;
+		unsigned char header[10];
+		unsigned long header_len;
+
+		get_oid(argv[0], &oid);
+		trg = read_object_file(&oid, &trg_type, &trg_size);
+
+		get_oid(argv[1], &oid);
+		src = read_object_file(&oid, &src_type, &src_size);
+
+		index = create_delta_index(src, src_size);
+		delta = create_delta(index, trg, trg_size, &delta_size, 8192);
+
+		z_delta_size = do_compress(&delta, delta_size);
+		header_len = encode_in_pack_object_header(header, sizeof(header),
+							  OBJ_REF_DELTA,
+							  delta_size);
+		fwrite(header, 1, header_len, stdout);
+		fwrite(oid.hash, 1, the_hash_algo->rawsz, stdout);
+		fwrite(delta, 1, z_delta_size, stdout);
+		return 0;
+	}
+
 	if (argc) {
 		base_name = argv[0];
 		argc--;
Duy Nguyen May 17, 2019, 7:20 a.m. UTC | #15
On Fri, May 17, 2019 at 12:35 PM Jeff King <peff@peff.net> wrote:
> As it turns out, index-pack does not handle these complicated cases at
> all! In the final fix_unresolved_deltas(), we are only looking for thin
> deltas, and anything that was not yet resolved is assumed to be a thin
> object. In many of these cases we _could_ resolve them if we tried
> harder. But that is good news for us because it means that these
> expectations about delta relationships are already there, and the
> pre-fetch done by your patch should always be 100% correct and
> efficient.

Is it worth keeping some of these notes in the "third pass" comment
block in index-pack.c to help future readers?
Jeff King May 17, 2019, 8:55 a.m. UTC | #16
On Fri, May 17, 2019 at 02:20:42PM +0700, Duy Nguyen wrote:

> On Fri, May 17, 2019 at 12:35 PM Jeff King <peff@peff.net> wrote:
> > As it turns out, index-pack does not handle these complicated cases at
> > all! In the final fix_unresolved_deltas(), we are only looking for thin
> > deltas, and anything that was not yet resolved is assumed to be a thin
> > object. In many of these cases we _could_ resolve them if we tried
> > harder. But that is good news for us because it means that these
> > expectations about delta relationships are already there, and the
> > pre-fetch done by your patch should always be 100% correct and
> > efficient.
> 
> Is it worth keeping some of these notes in the "third pass" comment
> block in index-pack.c to help future readers?

Perhaps. I started on the patch below, but I had trouble in the commit
message. I couldn't find the part of the code that explains why we would
never produce this combination, though empirically we do not.

-- >8 --
Subject: [PATCH] index-pack: describe an implication of our thin resolving

After digging into the delta resolution code, I discovered a surprising
(to me, anyway) implication of our strategy: we could never find a
non-thin delta with a thin delta as its base. This is OK because
pack-objects will never produce such a combination, because....?

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/index-pack.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index ccf4eb7e9b..f40f4560d4 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1224,6 +1224,13 @@ static void resolve_deltas(void)
  * Third pass:
  * - append objects to convert thin pack to full pack if required
  * - write the final pack hash
+ *
+ * Note that we assume all deltas at this phase are thin. We take only a
+ * single pass over the unresolved objects, and we look for bases only
+ * in our set of already-existing objects, _not_ other objects within this
+ * pack. This means that we would never find an object A stored as a delta
+ * against another object B in this pack, when B is a thin delta against a base
+ * not in the pack.
  */
 static void fix_unresolved_deltas(struct hashfile *f);
 static void conclude_pack(int fix_thin_pack, const char *curr_pack, unsigned char *pack_hash)
Johannes Schindelin May 17, 2019, 6:33 p.m. UTC | #17
Hi Jonathan,

On Wed, 15 May 2019, Jonathan Tan wrote:

> > > To resolve this, prefetch all missing REF_DELTA bases before attempting
> > > to resolve them. This both ensures that all bases are attempted to be
> > > fetched, and ensures that we make only one request per index-pack
> > > invocation, and not one request per missing object.
> >
> > Hmm. I wonder whether this can lead to *really* undesirable behavior, e.g.
> > with deep delta chains. The client would possibly have to fetch the
> > REF_DELTA object, but that would also be delivered in a thin pack with
> > *another* REF_DELTA object, and the same over and over again, with plenty
> > of round trips that kill performance really well.
>
> When the client fetches the REF_DELTA base, it won't be a REF_DELTA
> object itself because Git makes these fetches without any "have" lines,
> so the server doesn't know anything to delta against. Admittedly, this
> is just due how to we implemented it - if later we find a way to
> optimize the lazy fetches by adding "have", then we'll have to revisit
> this.

Ah! I *think* I understand this better now. Thank you.

> Quoting from the commit message:
>
> > > (When fetching REF_DELTA bases, it is unlikely that
> > > those are REF_DELTA themselves, because we do not send "have" when
> > > making such fetches.)
>
> I tried to address this point with this sentence in the commit message.
> If you think that this should be addressed more clearly in the commit
> message, let me know if you have any suggestions.

I totally read over this part of the commit message, apparently. My bad.

Sorry for the noise!
Dscho
Duy Nguyen May 18, 2019, 11:39 a.m. UTC | #18
On Fri, May 17, 2019 at 3:55 PM Jeff King <peff@peff.net> wrote:
>
> On Fri, May 17, 2019 at 02:20:42PM +0700, Duy Nguyen wrote:
>
> > On Fri, May 17, 2019 at 12:35 PM Jeff King <peff@peff.net> wrote:
> > > As it turns out, index-pack does not handle these complicated cases at
> > > all! In the final fix_unresolved_deltas(), we are only looking for thin
> > > deltas, and anything that was not yet resolved is assumed to be a thin
> > > object. In many of these cases we _could_ resolve them if we tried
> > > harder. But that is good news for us because it means that these
> > > expectations about delta relationships are already there, and the
> > > pre-fetch done by your patch should always be 100% correct and
> > > efficient.
> >
> > Is it worth keeping some of these notes in the "third pass" comment
> > block in index-pack.c to help future readers?
>
> Perhaps. I started on the patch below, but I had trouble in the commit
> message. I couldn't find the part of the code that explains why we would
> never produce this combination, though empirically we do not.

That still has some value even if your commit ends up with a question
mark. There's not much to dig out of 636171cb80 (make index-pack able
to complete thin packs., 2006-10-25). Adding Nico, maybe he still
remembers...

> -- >8 --
> Subject: [PATCH] index-pack: describe an implication of our thin resolving
>
> After digging into the delta resolution code, I discovered a surprising
> (to me, anyway) implication of our strategy: we could never find a
> non-thin delta with a thin delta as its base. This is OK because
> pack-objects will never produce such a combination, because....?
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  builtin/index-pack.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index ccf4eb7e9b..f40f4560d4 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1224,6 +1224,13 @@ static void resolve_deltas(void)
>   * Third pass:
>   * - append objects to convert thin pack to full pack if required
>   * - write the final pack hash
> + *
> + * Note that we assume all deltas at this phase are thin. We take only a
> + * single pass over the unresolved objects, and we look for bases only
> + * in our set of already-existing objects, _not_ other objects within this
> + * pack. This means that we would never find an object A stored as a delta
> + * against another object B in this pack, when B is a thin delta against a base
> + * not in the pack.
>   */
>  static void fix_unresolved_deltas(struct hashfile *f);
>  static void conclude_pack(int fix_thin_pack, const char *curr_pack, unsigned char *pack_hash)
> --
> 2.22.0.rc0.544.g1eb4087842
>
Nicolas Pitre May 20, 2019, 11:04 p.m. UTC | #19
On Sat, 18 May 2019, Duy Nguyen wrote:

> On Fri, May 17, 2019 at 3:55 PM Jeff King <peff@peff.net> wrote:
> >
> > On Fri, May 17, 2019 at 02:20:42PM +0700, Duy Nguyen wrote:
> >
> > > On Fri, May 17, 2019 at 12:35 PM Jeff King <peff@peff.net> wrote:
> > > > As it turns out, index-pack does not handle these complicated cases at
> > > > all! In the final fix_unresolved_deltas(), we are only looking for thin
> > > > deltas, and anything that was not yet resolved is assumed to be a thin
> > > > object. In many of these cases we _could_ resolve them if we tried
> > > > harder. But that is good news for us because it means that these
> > > > expectations about delta relationships are already there, and the
> > > > pre-fetch done by your patch should always be 100% correct and
> > > > efficient.
> > >
> > > Is it worth keeping some of these notes in the "third pass" comment
> > > block in index-pack.c to help future readers?
> >
> > Perhaps. I started on the patch below, but I had trouble in the commit
> > message. I couldn't find the part of the code that explains why we would
> > never produce this combination, though empirically we do not.

Good question indeed.

> That still has some value even if your commit ends up with a question
> mark. There's not much to dig out of 636171cb80 (make index-pack able
> to complete thin packs., 2006-10-25). Adding Nico, maybe he still
> remembers...

What about this comment in fix_unresolved_deltas():

        /*
         * Since many unresolved deltas may well be themselves base objects
         * for more unresolved deltas, we really want to include the
         * smallest number of base objects that would cover as much delta
         * as possible by picking the
         * trunc deltas first, allowing for other deltas to resolve without
         * additional base objects.  Since most base objects are to be found
         * before deltas depending on them, a good heuristic is to start
         * resolving deltas in the same order as their position in the pack.
         */

Doesn't that cover it?

In pack-objects, another comment says:

 * Depth value does not matter - find_deltas() will
 * never consider reused delta as the base object to
 * deltify other objects against, in order to avoid
 * circular deltas.

Sorry if I'm not of any help here. Although I used to have my brain 
wrapped around this code pretty tightly, it's been quite a while, and 
the code did change as well since then.


Nicolas
Jeff King May 21, 2019, 9:20 p.m. UTC | #20
On Mon, May 20, 2019 at 07:04:14PM -0400, Nicolas Pitre wrote:

> > That still has some value even if your commit ends up with a question
> > mark. There's not much to dig out of 636171cb80 (make index-pack able
> > to complete thin packs., 2006-10-25). Adding Nico, maybe he still
> > remembers...
> 
> What about this comment in fix_unresolved_deltas():
> 
>         /*
>          * Since many unresolved deltas may well be themselves base objects
>          * for more unresolved deltas, we really want to include the
>          * smallest number of base objects that would cover as much delta
>          * as possible by picking the
>          * trunc deltas first, allowing for other deltas to resolve without
>          * additional base objects.  Since most base objects are to be found
>          * before deltas depending on them, a good heuristic is to start
>          * resolving deltas in the same order as their position in the pack.
>          */
> 
> Doesn't that cover it?

Hmm. It does help, but only because my earlier comments were actually
wrong. I was claiming that index-pack would not be able to resolve "A"
from the delta chain when A is a regular delta, B is a thin delta, and C
is not in the pack.

Because I did not see how we would ever find base "B" in that third pass
of fix_unresolved_deltas(), because we look only for objects we already
have on disk.

But I think the trick that I was missing is that if we see "B" first,
we'll resolve it against C that we already have, but then we'll _also_
look for children that this now enables us to resolve. So we'd resolve
"A" at that point, and then when we later hit "A" in the loop from
fix_unresolved_deltas(), we skip it because of this:

   if (objects[d->obj_no].real_type != OBJ_REF_DELTA)
	continue;

So this situation _can_ happen, and we do handle it properly. I don't
know why the tests I showed in [1] didn't work. Obviously I botched
something.

I'm still not quite sure what would happen if we see "A" first (i.e.,
the delta is physically in the pack before its base, "B") and both are
REF_DELTA. Right now we'd skip over A, knowing that we don't have B. And
then when we get to B, we'd correctly resolve A on top of it (and if we
have no such B, we'd eventually complain "hey, there are still some
unresolved deltas").

But what happens if we _do_ have "B", but we just weren't able to tell
the sender for some reason (e.g., it's unreferenced). We'd add a new
copy of "B" to the pack while resolving "A", as part of --fix-thin. And
then when we resolve "B", we'd get _another_ copy of "B" in the pack,
and our result would have a duplicate.

I don't think this happens in practice because we'd generally use
OFS_DELTA in the first place these days. And even for REF_DELTA, I think
we prefer to put bases before their deltas (i.e., we'd always see "B"
before "A"). But I think if we ever _did_ see it (alternate
implementation? malicious packfile?) we'd generate duplicates. I _think_
that would then cause us to barf due to the duplicate check from
68be2fea50 (receive-pack, fetch-pack: reject bogus pack that records
objects twice, 2011-11-16).

And that's true today, even without Jonathan's on-demand fetching patch.
So I don't think that materially changes the requirements for
correctness. It does mean that we might fetch (but not use!) a thin base
we don't need, but only if the sender uses REF_DELTA for non-thin
deltas, which we wouldn't normally do.

-Peff

[1] https://public-inbox.org/git/20190517043939.GA12063@sigill.intra.peff.net/
Jonathan Nieder June 3, 2019, 10:23 p.m. UTC | #21
Jonathan Tan wrote:

> When fetching, the client sends "have" commit IDs indicating that the
> server does not need to send any object referenced by those commits,
> reducing network I/O. When the client is a partial clone, the client
> still sends "have"s in this way, even if it does not have every object
> referenced by a commit it sent as "have".
>
> If a server omits such an object, it is fine: the client could lazily
> fetch that object before this fetch, and it can still do so after.
>
> The issue is when the server sends a thin pack containing an object that
> is a REF_DELTA against such a missing object: index-pack fails to fix
> the thin pack. When support for lazily fetching missing objects was
> added in 8b4c0103a9 ("sha1_file: support lazily fetching missing
> objects", 2017-12-08), support in index-pack was turned off in the
> belief that it accesses the repo only to do hash collision checks.
> However, this is not true: it also needs to access the repo to resolve
> REF_DELTA bases.
[...]
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/index-pack.c     | 26 +++++++++++++++--
>  t/t5616-partial-clone.sh | 61 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 85 insertions(+), 2 deletions(-)

Thanks much.

This bugfix has been working well at $DAYJOB:
Tested-by: Jonathan Nieder <jrnieder@gmail.com>

Is it something that could be in 2.22.0 or 2.22.1?
diff mbox series

Patch

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index ccf4eb7e9b..0d55f73b0b 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -14,6 +14,7 @@ 
 #include "thread-utils.h"
 #include "packfile.h"
 #include "object-store.h"
+#include "fetch-object.h"
 
 static const char index_pack_usage[] =
 "git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
@@ -1351,6 +1352,25 @@  static void fix_unresolved_deltas(struct hashfile *f)
 		sorted_by_pos[i] = &ref_deltas[i];
 	QSORT(sorted_by_pos, nr_ref_deltas, delta_pos_compare);
 
+	if (repository_format_partial_clone) {
+		/*
+		 * Prefetch the delta bases.
+		 */
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		for (i = 0; i < nr_ref_deltas; i++) {
+			struct ref_delta_entry *d = sorted_by_pos[i];
+			if (!oid_object_info_extended(the_repository, &d->oid,
+						      NULL,
+						      OBJECT_INFO_FOR_PREFETCH))
+				continue;
+			oid_array_append(&to_fetch, &d->oid);
+		}
+		if (to_fetch.nr)
+			fetch_objects(repository_format_partial_clone,
+				      to_fetch.oid, to_fetch.nr);
+		oid_array_clear(&to_fetch);
+	}
+
 	for (i = 0; i < nr_ref_deltas; i++) {
 		struct ref_delta_entry *d = sorted_by_pos[i];
 		enum object_type type;
@@ -1650,8 +1670,10 @@  int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	int report_end_of_input = 0;
 
 	/*
-	 * index-pack never needs to fetch missing objects, since it only
-	 * accesses the repo to do hash collision checks
+	 * index-pack never needs to fetch missing objects except when
+	 * REF_DELTA bases are missing (which are explicitly handled). It only
+	 * accesses the repo to do hash collision checks and to check which
+	 * REF_DELTA bases need to be fetched.
 	 */
 	fetch_if_missing = 0;
 
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 7cc0c71556..f1baf83502 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -339,4 +339,65 @@  test_expect_success 'when partial cloning, tolerate server not sending target of
 	! test -e "$HTTPD_ROOT_PATH/one-time-sed"
 '
 
+test_expect_success 'tolerate server sending REF_DELTA against missing promisor objects' '
+	SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
+	rm -rf "$SERVER" repo &&
+	test_create_repo "$SERVER" &&
+	test_config -C "$SERVER" uploadpack.allowfilter 1 &&
+	test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 &&
+
+	# Create a commit with a blob to be used as a delta base.
+	for i in $(test_seq 10)
+	do
+		echo "this is a line" >>"$SERVER/foo.txt"
+	done &&
+	git -C "$SERVER" add foo.txt &&
+	git -C "$SERVER" commit -m bar &&
+	git -C "$SERVER" rev-parse HEAD:foo.txt >deltabase &&
+
+	git -c protocol.version=2 clone --no-checkout \
+		--filter=blob:none $HTTPD_URL/one_time_sed/server repo &&
+
+	# Sanity check to ensure that the client does not have that blob.
+	git -C repo rev-list --objects --exclude-promisor-objects \
+		-- $(cat deltabase) >objlist &&
+	test_line_count = 0 objlist &&
+
+	# Another commit. This commit will be fetched by the client.
+	echo "abcdefghijklmnopqrstuvwxyz" >>"$SERVER/foo.txt" &&
+	git -C "$SERVER" add foo.txt &&
+	git -C "$SERVER" commit -m baz &&
+
+	# Pack a thin pack containing, among other things, HEAD:foo.txt
+	# delta-ed against HEAD^:foo.txt.
+	printf "%s\n--not\n%s\n" \
+		$(git -C "$SERVER" rev-parse HEAD) \
+		$(git -C "$SERVER" rev-parse HEAD^) |
+		git -C "$SERVER" pack-objects --thin --stdout >thin.pack &&
+
+	# Ensure that the pack contains one delta against HEAD^:foo.txt. Since
+	# the delta contains at least 26 novel characters, the size cannot be
+	# contained in 4 bits, so the object header will take up 2 bytes. The
+	# most significant nybble of the first byte is 0b1111 (0b1 to indicate
+	# that the header continues, and 0b111 to indicate REF_DELTA), followed
+	# by any 3 nybbles, then the OID of the delta base.
+	git -C "$SERVER" rev-parse HEAD^:foo.txt >deltabase &&
+	printf "f.,..%s" $(intersperse "," <deltabase) >want &&
+	hex_unpack <thin.pack | intersperse "," >have &&
+	grep $(cat want) have &&
+
+	replace_packfile thin.pack &&
+
+	# Use protocol v2 because the sed command looks for the "packfile"
+	# section header.
+	test_config -C "$SERVER" protocol.version 2 &&
+
+	# Fetch the thin pack and ensure that index-pack is able to handle the
+	# REF_DELTA object with a missing promisor delta base.
+	git -C repo -c protocol.version=2 fetch &&
+
+	# Ensure that the one-time-sed script was used.
+	! test -e "$HTTPD_ROOT_PATH/one-time-sed"
+'
+
 test_done