[0/6,RFC] partial-clone: add ability to refetch with expanded filter

Message ID	pull.1138.git.1643730593.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.1138.git.1643730593.gitgitgadget@gmail.com> From: "Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> Date: Tue, 01 Feb 2022 15:49:47 +0000 Subject: [PATCH 0/6] [RFC] partial-clone: add ability to refetch with expanded filter Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, Derrick Stolee <stolee@gmail.com>, Taylor Blau <me@ttaylorr.com>, Christian Couder <christian.couder@gmail.com>, John Cai <johncai86@gmail.com>, Robert Coup <robert@coup.net.nz> Precedence: bulk
Series	partial-clone: add ability to refetch with expanded filter \| expand [0/6,RFC] partial-clone: add ability to refetch with expanded filter [1/6] fetch-negotiator: add specific noop initializor [2/6] fetch-pack: add partial clone refiltering [3/6] builtin/fetch-pack: add --refilter option [4/6] fetch: add --refilter option [5/6] t5615-partial-clone: add test for --refilter [6/6] doc/partial-clone: mention --refilter option [v2,8/8] doc/partial-clone: mention --repair fetch option

Johannes Schindelin via GitGitGadget Feb. 1, 2022, 3:49 p.m. UTC

If a filter is changed on a partial clone repository, for example from
blob:none to blob:limit=1m, there is currently no straightforward way to
bulk-refetch the objects that match the new filter for existing local
commits. This is because the client will report commits as "have" during
negotiation and any dependent objects won't be included in the transferred
pack. Another use case is discussed at [1].

This patch series proposes adding a --refilter option to fetch & fetch-pack
to enable doing a full fetch with a different filter, as if the local has no
commits in common with the remote. It builds upon cbe566a071
("negotiator/noop: add noop fetch negotiator", 2020-08-18).

To note:

 1. This will produce duplicated objects between the existing and newly
    fetched packs, but gc will clean them up.
 2. This series doesn't check that there's a new filter in any way, whether
    configured via config or passed via --filter=. Personally I think that's
    fine.
 3. If a user fetches with --refilter applying a more restrictive filter
    than previously (eg: blob:limit=1m then blob:limit=1k) the eventual
    state is a no-op, since any referenced object already in the local
    repository is never removed. Potentially this could be improved in
    future by more advanced gc, possibly along the lines discussed at [2].

[1]
https://public-inbox.org/git/aa7b89ee-08aa-7943-6a00-28dcf344426e@syntevo.com/
[2]
https://public-inbox.org/git/A4BAD509-FA1F-49C3-87AF-CF4B73C559F1@gmail.com/

Robert Coup (6):
  fetch-negotiator: add specific noop initializor
  fetch-pack: add partial clone refiltering
  builtin/fetch-pack: add --refilter option
  fetch: add --refilter option
  t5615-partial-clone: add test for --refilter
  doc/partial-clone: mention --refilter option

 Documentation/fetch-options.txt           |  9 ++++
 Documentation/git-fetch-pack.txt          |  4 ++
 Documentation/technical/partial-clone.txt |  3 ++
 builtin/fetch-pack.c                      |  4 ++
 builtin/fetch.c                           | 18 ++++++-
 fetch-negotiator.c                        |  5 ++
 fetch-negotiator.h                        |  8 ++++
 fetch-pack.c                              | 57 +++++++++++++++--------
 fetch-pack.h                              |  1 +
 remote-curl.c                             |  6 +++
 t/t5616-partial-clone.sh                  | 42 ++++++++++++++++-
 transport-helper.c                        |  3 ++
 transport.c                               |  4 ++
 transport.h                               |  4 ++
 14 files changed, 146 insertions(+), 22 deletions(-)


base-commit: 5d01301f2b865aa8dba1654d3f447ce9d21db0b5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1138%2Frcoup%2Frc-partial-clone-refilter-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1138/rcoup/rc-partial-clone-refilter-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1138

Junio C Hamano Feb. 1, 2022, 8:13 p.m. UTC | #1

"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:

> If a filter is changed on a partial clone repository, for example from
> blob:none to blob:limit=1m, there is currently no straightforward way to
> bulk-refetch the objects that match the new filter for existing local
> commits. This is because the client will report commits as "have" during
> negotiation and any dependent objects won't be included in the transferred
> pack.

It sounds like a useful thing to have such a "refetch things"
option.

A lazy/partial clone is narrower than the full tree in the width
dimension, while a shallow clone is shallower than the full history
in the time dimension.  The latter already has the "--deepen"
support to say "the commits listed in my shallow boundary list may
claim that I already have these, but I actually don't have them;
please stop lying to the other side and refetch what I should have
fetched earlier".  I understand that this works in the other
dimension to "--widen" things?

Makes me wonder how well these two features work together (or if
they are mutually exclusive, that is fine as well as a starting
point).

If you update the filter specification to make it narrower (e.g. you
start from blob:limit=1m down to blob:limit=512k), would we transfer
nothing (which would be ideal), or would we end up refetching
everything that are smaller than 512k?

> This patch series proposes adding a --refilter option to fetch & fetch-pack
> to enable doing a full fetch with a different filter, as if the local has no
> commits in common with the remote. It builds upon cbe566a071
> ("negotiator/noop: add noop fetch negotiator", 2020-08-18).

I guess the answer to the last question is ...

> To note:
>
>  1. This will produce duplicated objects between the existing and newly
>     fetched packs, but gc will clean them up.

... it is not smart enough to stell them to exclude what we _ought_
to have by telling them what the _old_ filter spec was.  That's OK
for a starting point, I guess.  Hopefully, at the end of this
operation, we should garbage collect the duplicated objects by
default (with an option to turn it off)?

>  2. This series doesn't check that there's a new filter in any way, whether
>     configured via config or passed via --filter=. Personally I think that's
>     fine.

In other words, a repository that used to be a partial clone can
become a full clone by using the option _and_ not giving any filter.
I think that is an intuitive enough behaviour and a natural
consequence to the extreme of what the feature is.  Compared to
making a full "git clone", fetching from the old local (and narrow)
repository into it and then discarding the old one, it would not
have any performance or storage advantage, but it probably is more
convenient.

>  3. If a user fetches with --refilter applying a more restrictive filter
>     than previously (eg: blob:limit=1m then blob:limit=1k) the eventual
>     state is a no-op, since any referenced object already in the local
>     repository is never removed. Potentially this could be improved in
>     future by more advanced gc, possibly along the lines discussed at [2].

OK.  That matches my reaction to 1. above.

Robert Coup Feb. 2, 2022, 3:02 p.m. UTC | #2

Hi Junio

Thanks for your input. Hopefully some of the other partial-clone
interested folks will chime in too.

On Tue, 1 Feb 2022 at 20:13, Junio C Hamano <gitster@pobox.com> wrote:
>
> It sounds like a useful thing to have such a "refetch things"
> option.

Any improved suggestions on the argument name? I thought of
--refetch but `fetch --refetch` seemed more confusing to explain.

> Makes me wonder how well these two features work together (or if
> they are mutually exclusive, that is fine as well as a starting
> point).

I don't see any particular reason they can't work together - as you say,
the filtering is orthogonal to shallow on a conceptual level. I haven't
added a test for that scenario yet but will do for a v1.

> If you update the filter specification to make it narrower (e.g. you
> start from blob:limit=1m down to blob:limit=512k), would we transfer
> nothing (which would be ideal), or would we end up refetching
> everything that are smaller than 512k?

As you spot, the latter. I can't see a straightforward way of telling the
server "I have these trees/blobs already" without generating (one way
or the other) a list of millions of oids, then transferring & negotiating
with it.

> ... it is not smart enough to stell them to exclude what we _ought_
> to have by telling them what the _old_ filter spec was.  That's OK
> for a starting point, I guess.

The client doesn't really know what the local repository *has* —
potentially several filters could have been applied and used for fetches
at different points in the commit history, as well as objects dynamically
fetched in. Even a filter set in the config only applies to subsequent
fetches, and only if --filter isn't used to override it.

> Hopefully, at the end of this
> operation, we should garbage collect the duplicated objects by
> default (with an option to turn it off)?

I haven't explicitly looked into invoking gc yet, but yes, it'd be a bit of
a waste if it wasn't kicked off by default. Maybe reusing gc.auto

> In other words, a repository that used to be a partial clone can
> become a full clone by using the option _and_ not giving any filter.

For that specific case I think you can already do it by removing the
promisor flag in the remote config, potentially adding it back if you
wanted to keep it partial again from that point forward.

> I think that is an intuitive enough behaviour and a natural
> consequence to the extreme of what the feature is.  Compared to
> making a full "git clone", fetching from the old local (and narrow)
> repository into it and then discarding the old one, it would not
> have any performance or storage advantage, but it probably is more
> convenient.

It's certainly cleaner than abusing --deepen, or temporarily moving pack
files out of the way, or starting over with a fresh clone & copying config.

Thanks,

Rob :)

Jonathan Tan Feb. 2, 2022, 6:59 p.m. UTC | #3

"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:
> If a filter is changed on a partial clone repository, for example from
> blob:none to blob:limit=1m, there is currently no straightforward way to
> bulk-refetch the objects that match the new filter for existing local
> commits. This is because the client will report commits as "have" during
> negotiation and any dependent objects won't be included in the transferred
> pack. Another use case is discussed at [1].

Reporting commits as "have" can be solved by forcing the noop
negotiator, but there is another issue (which you seem to have
discovered, glancing through your patches) in that fetch will abort (and
report success) if all wanted commits are present, even if not all
objects directly or indirectly referenced by those commits are present.

> This patch series proposes adding a --refilter option to fetch & fetch-pack
> to enable doing a full fetch with a different filter, as if the local has no
> commits in common with the remote. It builds upon cbe566a071
> ("negotiator/noop: add noop fetch negotiator", 2020-08-18).

Thanks - I think this is a useful feature. This is useful even in a
non-partial-clone repo, to repair objects that were, say, accidentally
deleted from the local object store.

If it's acceptable to have a separate command to configure the new
filter in the repo config (or to delete it, if we want to convert a
partial clone into a regular repo), I think it's clearer to name this
option "--repair" or something like that, and explain it as a fetch that
does not take into account the contents of the local object store (not
as a fetch that changes the filter).

Having said that, the overall feature is worth having. While we decide
on the name, the implementation of this will not change much. I'll try
to review the implementation by the end of this week (but other reviews
are welcome too, needless to say).

> To note:
> 
>  1. This will produce duplicated objects between the existing and newly
>     fetched packs, but gc will clean them up.

Noted. This might be an argument for naming the option "--repair", since
the user would probably understand that there would be duplicate
objects.

>  2. This series doesn't check that there's a new filter in any way, whether
>     configured via config or passed via --filter=. Personally I think that's
>     fine.

Agreed.

>  3. If a user fetches with --refilter applying a more restrictive filter
>     than previously (eg: blob:limit=1m then blob:limit=1k) the eventual
>     state is a no-op, since any referenced object already in the local
>     repository is never removed. Potentially this could be improved in
>     future by more advanced gc, possibly along the lines discussed at [2].

That's true.

Robert Coup Feb. 2, 2022, 9:58 p.m. UTC | #4

Hi Jonathan,

On Wed, 2 Feb 2022 at 19:00, Jonathan Tan <jonathantanmy@google.com> wrote:
>
> Thanks - I think this is a useful feature. This is useful even in a
> non-partial-clone repo, to repair objects that were, say, accidentally
> deleted from the local object store.

I'd

> If it's acceptable to have a separate command to configure the new
> filter in the repo config (or to delete it, if we want to convert a
> partial clone into a regular repo), I think it's clearer to name this
> option "--repair" or something like that, and explain it as a fetch that
> does not take into account the contents of the local object store (not
> as a fetch that changes the filter).

I quite like --repair, since the implementation really has zero to do with
filtering or partial clones beyond that being my use case for it. Specifying
a filter, shallow options, or even using a promisor remote aren't
even slightly necessary for the implementation as it turns out.

And as you say, that makes it easier to explain too:

"fetch --repair will fetch all objects from the remote (applying any filters
or shallow options as requested). Unlike a normal fetch it does not take into
account any content already in the local repository and acts more like an
initial clone. Any duplicate objects will get cleaned up during subsequent
maintenance."

"If you want to update your local repository with a different partial clone
filter, use `fetch --repair` to re-download all matching objects from the
remote."

> I'll try
> to review the implementation by the end of this week (but other reviews
> are welcome too, needless to say).

Thanks!

Rob :)

Robert Coup Feb. 2, 2022, 9:59 p.m. UTC | #5

On Wed, 2 Feb 2022 at 21:58, Robert Coup <robert@coup.net.nz> wrote:
>
> Hi Jonathan,
>
> On Wed, 2 Feb 2022 at 19:00, Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > Thanks - I think this is a useful feature. This is useful even in a
> > non-partial-clone repo, to repair objects that were, say, accidentally
> > deleted from the local object store.
>
> I'd

I'd be keen to hear the stories here :) Just losing loose object files?

Rob.

Jeff Hostetler Feb. 7, 2022, 7:37 p.m. UTC | #6

On 2/1/22 10:49 AM, Robert Coup via GitGitGadget wrote:
> If a filter is changed on a partial clone repository, for example from
> blob:none to blob:limit=1m, there is currently no straightforward way to
> bulk-refetch the objects that match the new filter for existing local
> commits. This is because the client will report commits as "have" during
> negotiation and any dependent objects won't be included in the transferred
> pack. Another use case is discussed at [1].
> 
> This patch series proposes adding a --refilter option to fetch & fetch-pack
> to enable doing a full fetch with a different filter, as if the local has no
> commits in common with the remote. It builds upon cbe566a071
> ("negotiator/noop: add noop fetch negotiator", 2020-08-18).
> 
> To note:
> 
>   1. This will produce duplicated objects between the existing and newly
>      fetched packs, but gc will clean them up.
>   2. This series doesn't check that there's a new filter in any way, whether
>      configured via config or passed via --filter=. Personally I think that's
>      fine.
>   3. If a user fetches with --refilter applying a more restrictive filter
>      than previously (eg: blob:limit=1m then blob:limit=1k) the eventual
>      state is a no-op, since any referenced object already in the local
>      repository is never removed. Potentially this could be improved in
>      future by more advanced gc, possibly along the lines discussed at [2].
> 

Yes, it would be nice to have a way to efficiently extend a
partial clone with a more inclusive filter.

It would be nice to be able to send the old filter-spec and the
new filter-spec and ask the server to send "new && !old" to keep
from having to resend the objects that the client already has.
But I'm not sure we know enough (on either side) to make that
computation.  And as you say, there is no guarantee that the client
has only used one filter in the past.

Jeff

Robert Coup Feb. 16, 2022, 1:24 p.m. UTC | #7

Hi Derrick,

On Wed, 2 Feb 2022 at 15:02, Robert Coup <robert@coup.net.nz> wrote:
> On Tue, 1 Feb 2022 at 20:13, Junio C Hamano <gitster@pobox.com> wrote:
> > "Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > >  1. This will produce duplicated objects between the existing and newly
> > >     fetched packs, but gc will clean them up.
> >
> > Hopefully, at the end of this
> > operation, we should garbage collect the duplicated objects by
> > default (with an option to turn it off)?
>
> I haven't explicitly looked into invoking gc yet, but yes, it'd be a bit of
> a waste if it wasn't kicked off by default. Maybe reusing gc.auto

Just looking into this: after a fetch, run_auto_maintenance() is called, which
invokes `git maintenance run --auto` (being the successor to `git gc --auto`)...

In the refilter (repair) case, we'll have a new pack which substantially
duplicates what's in our existing packs. I'm after some advice here: I'm not
sure whether I want to encourage a gc pack consolidation, an incremental
repack, both, neither, or something else?

My current train of thought is it invokes `git maintenance run --auto` with
gc.autoPackLimit=1 to force a gc pack consolidation if it's not currently
=0 (disabled), and with maintenance.incremental-repack.auto=-1 if it's not
currently =0 (disabled) to force an incremental repack if someone doesn't
want to do pack consolidations.

In all cases the various enabled-ness settings should still be honoured to
skip tasks if the user doesn't want them ever happening automatically.

Implementation-wise I guess we'll need run_auto_maintenance() to be able
to pass some config options through to the maintenance subprocess.

Does this sound like a reasonable approach?

Thanks,

Rob :)

[0/6,RFC] partial-clone: add ability to refetch with expanded filter

Message

Comments