Message ID | pull.1138.git.1643730593.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | partial-clone: add ability to refetch with expanded filter | expand |
"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes: > If a filter is changed on a partial clone repository, for example from > blob:none to blob:limit=1m, there is currently no straightforward way to > bulk-refetch the objects that match the new filter for existing local > commits. This is because the client will report commits as "have" during > negotiation and any dependent objects won't be included in the transferred > pack. It sounds like a useful thing to have such a "refetch things" option. A lazy/partial clone is narrower than the full tree in the width dimension, while a shallow clone is shallower than the full history in the time dimension. The latter already has the "--deepen" support to say "the commits listed in my shallow boundary list may claim that I already have these, but I actually don't have them; please stop lying to the other side and refetch what I should have fetched earlier". I understand that this works in the other dimension to "--widen" things? Makes me wonder how well these two features work together (or if they are mutually exclusive, that is fine as well as a starting point). If you update the filter specification to make it narrower (e.g. you start from blob:limit=1m down to blob:limit=512k), would we transfer nothing (which would be ideal), or would we end up refetching everything that are smaller than 512k? > This patch series proposes adding a --refilter option to fetch & fetch-pack > to enable doing a full fetch with a different filter, as if the local has no > commits in common with the remote. It builds upon cbe566a071 > ("negotiator/noop: add noop fetch negotiator", 2020-08-18). I guess the answer to the last question is ... > To note: > > 1. This will produce duplicated objects between the existing and newly > fetched packs, but gc will clean them up. ... it is not smart enough to stell them to exclude what we _ought_ to have by telling them what the _old_ filter spec was. That's OK for a starting point, I guess. Hopefully, at the end of this operation, we should garbage collect the duplicated objects by default (with an option to turn it off)? > 2. This series doesn't check that there's a new filter in any way, whether > configured via config or passed via --filter=. Personally I think that's > fine. In other words, a repository that used to be a partial clone can become a full clone by using the option _and_ not giving any filter. I think that is an intuitive enough behaviour and a natural consequence to the extreme of what the feature is. Compared to making a full "git clone", fetching from the old local (and narrow) repository into it and then discarding the old one, it would not have any performance or storage advantage, but it probably is more convenient. > 3. If a user fetches with --refilter applying a more restrictive filter > than previously (eg: blob:limit=1m then blob:limit=1k) the eventual > state is a no-op, since any referenced object already in the local > repository is never removed. Potentially this could be improved in > future by more advanced gc, possibly along the lines discussed at [2]. OK. That matches my reaction to 1. above.
Hi Junio Thanks for your input. Hopefully some of the other partial-clone interested folks will chime in too. On Tue, 1 Feb 2022 at 20:13, Junio C Hamano <gitster@pobox.com> wrote: > > It sounds like a useful thing to have such a "refetch things" > option. Any improved suggestions on the argument name? I thought of --refetch but `fetch --refetch` seemed more confusing to explain. > Makes me wonder how well these two features work together (or if > they are mutually exclusive, that is fine as well as a starting > point). I don't see any particular reason they can't work together - as you say, the filtering is orthogonal to shallow on a conceptual level. I haven't added a test for that scenario yet but will do for a v1. > If you update the filter specification to make it narrower (e.g. you > start from blob:limit=1m down to blob:limit=512k), would we transfer > nothing (which would be ideal), or would we end up refetching > everything that are smaller than 512k? As you spot, the latter. I can't see a straightforward way of telling the server "I have these trees/blobs already" without generating (one way or the other) a list of millions of oids, then transferring & negotiating with it. > ... it is not smart enough to stell them to exclude what we _ought_ > to have by telling them what the _old_ filter spec was. That's OK > for a starting point, I guess. The client doesn't really know what the local repository *has* — potentially several filters could have been applied and used for fetches at different points in the commit history, as well as objects dynamically fetched in. Even a filter set in the config only applies to subsequent fetches, and only if --filter isn't used to override it. > Hopefully, at the end of this > operation, we should garbage collect the duplicated objects by > default (with an option to turn it off)? I haven't explicitly looked into invoking gc yet, but yes, it'd be a bit of a waste if it wasn't kicked off by default. Maybe reusing gc.auto > In other words, a repository that used to be a partial clone can > become a full clone by using the option _and_ not giving any filter. For that specific case I think you can already do it by removing the promisor flag in the remote config, potentially adding it back if you wanted to keep it partial again from that point forward. > I think that is an intuitive enough behaviour and a natural > consequence to the extreme of what the feature is. Compared to > making a full "git clone", fetching from the old local (and narrow) > repository into it and then discarding the old one, it would not > have any performance or storage advantage, but it probably is more > convenient. It's certainly cleaner than abusing --deepen, or temporarily moving pack files out of the way, or starting over with a fresh clone & copying config. Thanks, Rob :)
"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes: > If a filter is changed on a partial clone repository, for example from > blob:none to blob:limit=1m, there is currently no straightforward way to > bulk-refetch the objects that match the new filter for existing local > commits. This is because the client will report commits as "have" during > negotiation and any dependent objects won't be included in the transferred > pack. Another use case is discussed at [1]. Reporting commits as "have" can be solved by forcing the noop negotiator, but there is another issue (which you seem to have discovered, glancing through your patches) in that fetch will abort (and report success) if all wanted commits are present, even if not all objects directly or indirectly referenced by those commits are present. > This patch series proposes adding a --refilter option to fetch & fetch-pack > to enable doing a full fetch with a different filter, as if the local has no > commits in common with the remote. It builds upon cbe566a071 > ("negotiator/noop: add noop fetch negotiator", 2020-08-18). Thanks - I think this is a useful feature. This is useful even in a non-partial-clone repo, to repair objects that were, say, accidentally deleted from the local object store. If it's acceptable to have a separate command to configure the new filter in the repo config (or to delete it, if we want to convert a partial clone into a regular repo), I think it's clearer to name this option "--repair" or something like that, and explain it as a fetch that does not take into account the contents of the local object store (not as a fetch that changes the filter). Having said that, the overall feature is worth having. While we decide on the name, the implementation of this will not change much. I'll try to review the implementation by the end of this week (but other reviews are welcome too, needless to say). > To note: > > 1. This will produce duplicated objects between the existing and newly > fetched packs, but gc will clean them up. Noted. This might be an argument for naming the option "--repair", since the user would probably understand that there would be duplicate objects. > 2. This series doesn't check that there's a new filter in any way, whether > configured via config or passed via --filter=. Personally I think that's > fine. Agreed. > 3. If a user fetches with --refilter applying a more restrictive filter > than previously (eg: blob:limit=1m then blob:limit=1k) the eventual > state is a no-op, since any referenced object already in the local > repository is never removed. Potentially this could be improved in > future by more advanced gc, possibly along the lines discussed at [2]. That's true.
Hi Jonathan, On Wed, 2 Feb 2022 at 19:00, Jonathan Tan <jonathantanmy@google.com> wrote: > > Thanks - I think this is a useful feature. This is useful even in a > non-partial-clone repo, to repair objects that were, say, accidentally > deleted from the local object store. I'd > If it's acceptable to have a separate command to configure the new > filter in the repo config (or to delete it, if we want to convert a > partial clone into a regular repo), I think it's clearer to name this > option "--repair" or something like that, and explain it as a fetch that > does not take into account the contents of the local object store (not > as a fetch that changes the filter). I quite like --repair, since the implementation really has zero to do with filtering or partial clones beyond that being my use case for it. Specifying a filter, shallow options, or even using a promisor remote aren't even slightly necessary for the implementation as it turns out. And as you say, that makes it easier to explain too: "fetch --repair will fetch all objects from the remote (applying any filters or shallow options as requested). Unlike a normal fetch it does not take into account any content already in the local repository and acts more like an initial clone. Any duplicate objects will get cleaned up during subsequent maintenance." "If you want to update your local repository with a different partial clone filter, use `fetch --repair` to re-download all matching objects from the remote." > I'll try > to review the implementation by the end of this week (but other reviews > are welcome too, needless to say). Thanks! Rob :)
On Wed, 2 Feb 2022 at 21:58, Robert Coup <robert@coup.net.nz> wrote: > > Hi Jonathan, > > On Wed, 2 Feb 2022 at 19:00, Jonathan Tan <jonathantanmy@google.com> wrote: > > > > Thanks - I think this is a useful feature. This is useful even in a > > non-partial-clone repo, to repair objects that were, say, accidentally > > deleted from the local object store. > > I'd I'd be keen to hear the stories here :) Just losing loose object files? Rob.
On 2/1/22 10:49 AM, Robert Coup via GitGitGadget wrote: > If a filter is changed on a partial clone repository, for example from > blob:none to blob:limit=1m, there is currently no straightforward way to > bulk-refetch the objects that match the new filter for existing local > commits. This is because the client will report commits as "have" during > negotiation and any dependent objects won't be included in the transferred > pack. Another use case is discussed at [1]. > > This patch series proposes adding a --refilter option to fetch & fetch-pack > to enable doing a full fetch with a different filter, as if the local has no > commits in common with the remote. It builds upon cbe566a071 > ("negotiator/noop: add noop fetch negotiator", 2020-08-18). > > To note: > > 1. This will produce duplicated objects between the existing and newly > fetched packs, but gc will clean them up. > 2. This series doesn't check that there's a new filter in any way, whether > configured via config or passed via --filter=. Personally I think that's > fine. > 3. If a user fetches with --refilter applying a more restrictive filter > than previously (eg: blob:limit=1m then blob:limit=1k) the eventual > state is a no-op, since any referenced object already in the local > repository is never removed. Potentially this could be improved in > future by more advanced gc, possibly along the lines discussed at [2]. > Yes, it would be nice to have a way to efficiently extend a partial clone with a more inclusive filter. It would be nice to be able to send the old filter-spec and the new filter-spec and ask the server to send "new && !old" to keep from having to resend the objects that the client already has. But I'm not sure we know enough (on either side) to make that computation. And as you say, there is no guarantee that the client has only used one filter in the past. Jeff
Hi Derrick, On Wed, 2 Feb 2022 at 15:02, Robert Coup <robert@coup.net.nz> wrote: > On Tue, 1 Feb 2022 at 20:13, Junio C Hamano <gitster@pobox.com> wrote: > > "Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > 1. This will produce duplicated objects between the existing and newly > > > fetched packs, but gc will clean them up. > > > > Hopefully, at the end of this > > operation, we should garbage collect the duplicated objects by > > default (with an option to turn it off)? > > I haven't explicitly looked into invoking gc yet, but yes, it'd be a bit of > a waste if it wasn't kicked off by default. Maybe reusing gc.auto Just looking into this: after a fetch, run_auto_maintenance() is called, which invokes `git maintenance run --auto` (being the successor to `git gc --auto`)... In the refilter (repair) case, we'll have a new pack which substantially duplicates what's in our existing packs. I'm after some advice here: I'm not sure whether I want to encourage a gc pack consolidation, an incremental repack, both, neither, or something else? My current train of thought is it invokes `git maintenance run --auto` with gc.autoPackLimit=1 to force a gc pack consolidation if it's not currently =0 (disabled), and with maintenance.incremental-repack.auto=-1 if it's not currently =0 (disabled) to force an incremental repack if someone doesn't want to do pack consolidations. In all cases the various enabled-ness settings should still be honoured to skip tasks if the user doesn't want them ever happening automatically. Implementation-wise I guess we'll need run_auto_maintenance() to be able to pass some config options through to the maintenance subprocess. Does this sound like a reasonable approach? Thanks, Rob :)