[v2] send-pack: never fetch when checking exclusions
diff mbox series

Message ID 20191008183739.194714-1-jonathantanmy@google.com
State New
Headers show
Series
  • [v2] send-pack: never fetch when checking exclusions
Related show

Commit Message

Jonathan Tan Oct. 8, 2019, 6:37 p.m. UTC
When building the packfile to be sent, send_pack() is given a list of
remote refs to be used as exclusions. For each ref, it first checks if
the ref exists locally, and if it does, passes it with a "^" prefix to
pack-objects. However, in a partial clone, the check may trigger a lazy
fetch.

The additional commit ancestry information obtained during such fetches
may show that certain objects that would have been sent are already
known to the server, resulting in a smaller pack being sent. But this is
at the cost of fetching from many possibly unrelated refs, and the lazy
fetches do not help at all in the typical case where the client is
up-to-date with the upstream of the branch being pushed.

Ensure that these lazy fetches do not occur.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
> For example, would this change mean that the resulting pack may
> include stuff that are reachable from the (missing) negative objects
> that would not otherwise have to be sent if these objects were
> available (or made available by the lazy fetching), and we are
> making a trade-off to send possibly more in order for not fetching?
> Have we laid enough on the table to help readers if such a trade-off
> (if we are making one, that is) strikes the right balance?

Thanks for your comments. I've expanded the commit message.
---
 send-pack.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jeff King Oct. 11, 2019, 6:12 a.m. UTC | #1
On Tue, Oct 08, 2019 at 11:37:39AM -0700, Jonathan Tan wrote:

> When building the packfile to be sent, send_pack() is given a list of
> remote refs to be used as exclusions. For each ref, it first checks if
> the ref exists locally, and if it does, passes it with a "^" prefix to
> pack-objects. However, in a partial clone, the check may trigger a lazy
> fetch.
> 
> The additional commit ancestry information obtained during such fetches
> may show that certain objects that would have been sent are already
> known to the server, resulting in a smaller pack being sent. But this is
> at the cost of fetching from many possibly unrelated refs, and the lazy
> fetches do not help at all in the typical case where the client is
> up-to-date with the upstream of the branch being pushed.
> 
> Ensure that these lazy fetches do not occur.

That makes sense. For similar reasons, should we be using
OBJECT_INFO_QUICK here? If the other side has a bunch of ref tips that
we don't have, we'll end up re-scanning the pack directory over and over
(which is _usually_ pretty quick, but can be slow if you have a lot of
packs locally). And it's OK if we racily miss out on an exclusion due to
somebody else repacking simultaneously.

-Peff
Derrick Stolee Oct. 11, 2019, 12:31 p.m. UTC | #2
On 10/11/2019 2:12 AM, Jeff King wrote:
> On Tue, Oct 08, 2019 at 11:37:39AM -0700, Jonathan Tan wrote:
> 
>> When building the packfile to be sent, send_pack() is given a list of
>> remote refs to be used as exclusions. For each ref, it first checks if
>> the ref exists locally, and if it does, passes it with a "^" prefix to
>> pack-objects. However, in a partial clone, the check may trigger a lazy
>> fetch.
>>
>> The additional commit ancestry information obtained during such fetches
>> may show that certain objects that would have been sent are already
>> known to the server, resulting in a smaller pack being sent. But this is
>> at the cost of fetching from many possibly unrelated refs, and the lazy
>> fetches do not help at all in the typical case where the client is
>> up-to-date with the upstream of the branch being pushed.
>>
>> Ensure that these lazy fetches do not occur.
> 
> That makes sense. For similar reasons, should we be using
> OBJECT_INFO_QUICK here? If the other side has a bunch of ref tips that
> we don't have, we'll end up re-scanning the pack directory over and over
> (which is _usually_ pretty quick, but can be slow if you have a lot of
> packs locally). And it's OK if we racily miss out on an exclusion due to
> somebody else repacking simultaneously.

That's a good idea. We can hint to the object store that we don't expect
misses to be due to a concurrent repack, so we don't want to reprepare
pack-files.

-Stolee
Jeff King Oct. 11, 2019, 4:15 p.m. UTC | #3
On Fri, Oct 11, 2019 at 08:31:30AM -0400, Derrick Stolee wrote:

> >> Ensure that these lazy fetches do not occur.
> > 
> > That makes sense. For similar reasons, should we be using
> > OBJECT_INFO_QUICK here? If the other side has a bunch of ref tips that
> > we don't have, we'll end up re-scanning the pack directory over and over
> > (which is _usually_ pretty quick, but can be slow if you have a lot of
> > packs locally). And it's OK if we racily miss out on an exclusion due to
> > somebody else repacking simultaneously.
> 
> That's a good idea. We can hint to the object store that we don't expect
> misses to be due to a concurrent repack, so we don't want to reprepare
> pack-files.

As a general rule (and why I'm raising this issue in reply to Jonathan's
patch), I think most or all sites that want OBJECT_INFO_QUICK will want
SKIP_FETCH_OBJECT as well, and vice versa. The reasoning is generally
the same:

  - it's OK to racily have a false negative (we'll still be correct, but
    possibly a little less optimal)

  - it's expected and normal to be missing the object, so spending time
    double-checking the pack store wastes measurable time in real-world
    cases

-Peff
Jonathan Tan Oct. 11, 2019, 10:08 p.m. UTC | #4
> As a general rule (and why I'm raising this issue in reply to Jonathan's
> patch), I think most or all sites that want OBJECT_INFO_QUICK will want
> SKIP_FETCH_OBJECT as well, and vice versa. The reasoning is generally
> the same:
> 
>   - it's OK to racily have a false negative (we'll still be correct, but
>     possibly a little less optimal)
> 
>   - it's expected and normal to be missing the object, so spending time
>     double-checking the pack store wastes measurable time in real-world
>     cases

I took a look on "next" and it's true for these reasons in most cases
but not all.

QUICK implies SKIP_FETCH_OBJECT:

  fetch-pack.c: Run with fetch_if_missing=0 (from builtin/fetch.c,
  builtin/fetch-pack.c, or through a lazy fetch) so OK.
  
  builtin/index-pack.c: Run with fetch_if_missing=0, so OK.
  
  builtin/fetch.c: Run with fetch_if_missing=0, so OK.
  
  object-store.h, sha1-file.c: Definition and implementation of this
  flag.

Everything is OK here. Now, SKIP_FETCH_OBJECT implies QUICK:

  cache-tree.c: I added this recently in f981ec18cf ("cache-tree: do not
  lazy-fetch tentative tree", 2019-09-09). No problem with a false
  negative, since we know how to reconstruct the tree. OK.
  
  object-store.h, sha1-file.c: Definition and implementation of this
  flag.
  
  send-pack.c: This patch (which is already in "next"). If we have a
  false negative, we might accidentally send more than we need. But that
  is not too bad.
  
  promisor-remote.c: This is the slightly tricky one. We use this
  information to determine if we got our lazily-fetched object from the
  most recent lazy fetch, or if we should continue attempting to fetch the
  given object from other promisor remotes; so this information is
  important. However, adding QUICK doesn't lose us anything because the
  lack of QUICK only helps us when there is another process packing
  loose objects: if we got our object, our object will be in a pack
  (because of the way the fetch is implemented - in particular, we need
  a pack because we need the ".promisor" file).

So everything is OK except for promisor-remote.c, but even that is OK
for another reason.

Having said that, perhaps we should consider promisor-remote.c as
low-level code and expect it to know that objects are fetched into a
packfile (as opposed to loose objects), so it can safely use QUICK
(which is documented as checking packed after packed and loose). If no
one disagrees, I can make such a patch after jt/push-avoid-lazy-fetch is
merged to master (as is the plan, according to What's Cooking [1]).

[1] https://public-inbox.org/git/xmqq8sprhgzc.fsf@gitster-ct.c.googlers.com/
Junio C Hamano Oct. 12, 2019, 12:47 a.m. UTC | #5
Jeff King <peff@peff.net> writes:

> As a general rule (and why I'm raising this issue in reply to Jonathan's
> patch), I think most or all sites that want OBJECT_INFO_QUICK will want
> SKIP_FETCH_OBJECT as well, and vice versa. The reasoning is generally
> the same:
>
>   - it's OK to racily have a false negative (we'll still be correct, but
>     possibly a little less optimal)
>
>   - it's expected and normal to be missing the object, so spending time
>     double-checking the pack store wastes measurable time in real-world
>     cases

31f5256c ("sha1-file: split OBJECT_INFO_FOR_PREFETCH", 2019-05-28)
separated SKIP_FETCH_OBJECT out of FOR_PREFETCH, the latter of which
was and is SKIP_FETCH and QUICK combined.  Use SKIP_FETCH_OBJECT
alone may need to be re-examined and discouraged?
Jeff King Oct. 17, 2019, 6:10 a.m. UTC | #6
On Fri, Oct 11, 2019 at 03:08:22PM -0700, Jonathan Tan wrote:

> > As a general rule (and why I'm raising this issue in reply to Jonathan's
> > patch), I think most or all sites that want OBJECT_INFO_QUICK will want
> > SKIP_FETCH_OBJECT as well, and vice versa. The reasoning is generally
> > the same:
> > 
> >   - it's OK to racily have a false negative (we'll still be correct, but
> >     possibly a little less optimal)
> > 
> >   - it's expected and normal to be missing the object, so spending time
> >     double-checking the pack store wastes measurable time in real-world
> >     cases
> 
> I took a look on "next" and it's true for these reasons in most cases
> but not all.

Thanks for digging into this.

> QUICK implies SKIP_FETCH_OBJECT:
> 
>   fetch-pack.c: Run with fetch_if_missing=0 (from builtin/fetch.c,
>   builtin/fetch-pack.c, or through a lazy fetch) so OK.
>   
>   builtin/index-pack.c: Run with fetch_if_missing=0, so OK.
>   
>   builtin/fetch.c: Run with fetch_if_missing=0, so OK.
>   
>   object-store.h, sha1-file.c: Definition and implementation of this
>   flag.

Right, I think going in this direction is pretty simple. Having been
marked with QUICK, they hit both of my points from above. And if we want
to avoid re-scanning the pack directory because of cost, we _definitely_
want to avoid making an expensive network call.

> Everything is OK here. Now, SKIP_FETCH_OBJECT implies QUICK:
> 
>   cache-tree.c: I added this recently in f981ec18cf ("cache-tree: do not
>   lazy-fetch tentative tree", 2019-09-09). No problem with a false
>   negative, since we know how to reconstruct the tree. OK.
> [...]
>   send-pack.c: This patch (which is already in "next"). If we have a
>   false negative, we might accidentally send more than we need. But that
>   is not too bad.

Yeah, I think both of these could be QUICK.

>   promisor-remote.c: This is the slightly tricky one. We use this
>   information to determine if we got our lazily-fetched object from the
>   most recent lazy fetch, or if we should continue attempting to fetch the
>   given object from other promisor remotes; so this information is
>   important. However, adding QUICK doesn't lose us anything because the
>   lack of QUICK only helps us when there is another process packing
>   loose objects: if we got our object, our object will be in a pack
>   (because of the way the fetch is implemented - in particular, we need
>   a pack because we need the ".promisor" file).
> 
> So everything is OK except for promisor-remote.c, but even that is OK
> for another reason.

Yeah, though I wouldn't be sad to see that use a separate flag, since it
really is about promisor logic.

That implies to me maybe we should be using QUICK more aggressively, and
QUICK should auto-imply SKIP_FETCH_OBJECT.

> Having said that, perhaps we should consider promisor-remote.c as
> low-level code and expect it to know that objects are fetched into a
> packfile (as opposed to loose objects), so it can safely use QUICK
> (which is documented as checking packed after packed and loose). If no
> one disagrees, I can make such a patch after jt/push-avoid-lazy-fetch is
> merged to master (as is the plan, according to What's Cooking [1]).

I think it's OK to continue leaving out QUICK there if it's not causing
problems. It really is a bit different than the other cases.

-Peff

Patch
diff mbox series

diff --git a/send-pack.c b/send-pack.c
index 6dc16c3211..34c77cbb1a 100644
--- a/send-pack.c
+++ b/send-pack.c
@@ -40,7 +40,8 @@  int option_parse_push_signed(const struct option *opt,
 
 static void feed_object(const struct object_id *oid, FILE *fh, int negative)
 {
-	if (negative && !has_object_file(oid))
+	if (negative &&
+	    !has_object_file_with_flags(oid, OBJECT_INFO_SKIP_FETCH_OBJECT))
 		return;
 
 	if (negative)