Message ID | pull.1138.v3.git.1646406274.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | fetch: add repair: full refetch without negotiation (was: "refiltering") | expand |
"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes: > If a filter is changed on a partial clone repository, for example from > blob:none to blob:limit=1m, there is currently no straightforward way to > bulk-refetch the objects that match the new filter for existing local > commits. This is because the client will report commits as "have" during > fetch negotiation and any dependent objects won't be included in the > transferred pack. Another use case is discussed at [1]. > > This patch series introduces a --refetch option to fetch & fetch-pack to > enable doing a full fetch without performing any commit negotiation with the > remote, as a fresh clone does. It builds upon cbe566a071 ("negotiator/noop: > add noop fetch negotiator", 2020-08-18). Hi Robert, This is my first time sending a review to the list, so forgive me for any mistakes I make or conventions missed. Feedback about my review would be well appreciated! Overall I think your patch is well written and the implementation accomplishes what you describe in your cover letter, however, I would like to discuss another possible design I thought of. Currently, the user has to know to run ‘--refetch’ after changing the partial clone filter configuration in order to fetch the commits that match the new filter. Ideally I believe this behavior should be default so therefore instead of adding an option, if git knew what filter was last used in the fetch, it could automatically ‘refetch’ everything if there is a change between the last used filter and the default filter. I’m not sure if the config is the best location to store the last used filter, but we can use it as an example for now. The tradeoff here is balancing between having an additional config variable and having the user know to specify a parameter to fetch after changing the config. And in the future, if there are other use cases for needing a fetch that bypasses commit negotiation (I know you described one such possible use case in v2), the ‘--refetch’ option can easily be readded to hook into this patch. Looking forward to hearing your thoughts!
Hi Calvin, On Wed, 9 Mar 2022 at 00:27, Calvin Wan <calvinwan@google.com> wrote: > > This is my first time sending a review to the list, so forgive me for any > mistakes I make or conventions missed. Feedback about my review would be well > appreciated! I'm as new as you to contributing to Git :-) > Overall I think your patch is well written and the implementation accomplishes > what you describe in your cover letter, Thanks! I think if you're completely content you can add your Reviewed-By as described at https://git-scm.com/docs/SubmittingPatches#commit-trailers > I would like to discuss another > possible design I thought of. Currently, the user has to know to run ‘--refetch’ > after changing the partial clone filter configuration in order to fetch the > commits that match the new filter. Ideally I believe this behavior should be > default so therefore instead of adding an option, if git knew what filter was > last used in the fetch, it could automatically ‘refetch’ everything if there is > a change between the last used filter and the default filter. So, if you do a partial clone using `git clone --filter=...` then the filter is saved into the config at `remote.<name>.partialclonefilter` and is re-used by default for subsequent fetches from that remote. But there's nothing to stop `git fetch --filter=...` being run multiple times with different filters to carefully setup a repository for a particular use case, or any notion that there has to be "one" filter in place for a remote. Running `git fetch --filter=...` doesn't update the remote's partial clone filter in the config, and IMO it shouldn't for the above reason. I think having `git config remote.<name>.partialclonefilter <new-filter>` print out something to the user along the lines of "Your change to/removal of the filter won't fetch in additional objects associated with existing commits, you can do this with `fetch --refetch <remote>`" could be helpful, but after a very quick look I can't see anything like that at the moment for other config settings (ie. no plumbing in place to easily reuse), and I'm not motivated to add such plumbing. Cheers, Rob :)
Robert Coup <robert@coup.net.nz> writes: > So, if you do a partial clone using `git clone --filter=...` then the > filter is saved into the config at `remote.<name>.partialclonefilter` > and is re-used by default for subsequent fetches from that remote. But > there's nothing to stop `git fetch --filter=...` being run multiple > times with different filters to carefully setup a repository for a > particular use case, or any notion that there has to be "one" filter > in place for a remote. The way I read Calvin's suggestion was that you won't allow such a random series of "git fetch"es without updating the "this is the filter that is consistent with the contents of this repository" record, which will lead to inconsistencies. I.e. - we must maintain the "filter that is consistent with the contents of this repository", which this series does not do, but we should. - the "--refetch" is unnecessary and redundant, as long as such a record is maintained; when a filter settings changes, we should do the equivalent of "--refetch" automatically. IOW, ... > Running `git fetch --filter=...` doesn't update the remote's partial > clone filter in the config, and IMO it shouldn't for the above reason. ... isn't "git fetch --fitler" that does not update the configured filter (and does not do a refetch automatically) a bug that made the "refetch" necessary in the first place? Or perhaps I read Calvin incorrectly?
On 3/9/2022 1:32 PM, Junio C Hamano wrote: > - we must maintain the "filter that is consistent with the contents > of this repository", which this series does not do, but we should. If this is possible/reasonably implementable then... > - the "--refetch" is unnecessary and redundant, as long as such a > record is maintained; when a filter settings changes, we should > do the equivalent of "--refetch" automatically. ... this is the logical conclusion that can be drawn. > isn't "git fetch --filter" that does not update the configured > filter (and does not do a refetch automatically) a bug that made the > "refetch" necessary in the first place? There are two cases to be considered here: 1. The user changes the filter config using "git config" 2. The user runs "git fetch --filter" The first case is the use case advocated by Robert for "--refetch". I originally suggested saving the last used filter into the config and if the last used filter != the configured filter, then fetch automatically "refetches" everything. The second use case is the one Junio believes I am referring to, however, whether to classify this as a bug or feature, I am uncertain. Like above, I suggested saving this as the last used filter to be used as a comparison to the configured filter. Setting this filter as the configured filter can be a separate discussion to be had because I can see pros/cons for it. Ultimately the expectation is that if I run "git fetch" without --filter, then I will fetch based on the config filter. And if I have previously run "git fetch" with the same filter, whether through the config or my own filter, then Git will only fetch the latest refs/objects. On Wed, Mar 9, 2022 at 1:32 PM Junio C Hamano <gitster@pobox.com> wrote: > > Robert Coup <robert@coup.net.nz> writes: > > > So, if you do a partial clone using `git clone --filter=...` then the > > filter is saved into the config at `remote.<name>.partialclonefilter` > > and is re-used by default for subsequent fetches from that remote. But > > there's nothing to stop `git fetch --filter=...` being run multiple > > times with different filters to carefully setup a repository for a > > particular use case, or any notion that there has to be "one" filter > > in place for a remote. > > The way I read Calvin's suggestion was that you won't allow such a > random series of "git fetch"es without updating the "this is the > filter that is consistent with the contents of this repository" > record, which will lead to inconsistencies. I.e. > > - we must maintain the "filter that is consistent with the contents > of this repository", which this series does not do, but we should. > > - the "--refetch" is unnecessary and redundant, as long as such a > record is maintained; when a filter settings changes, we should > do the equivalent of "--refetch" automatically. > > IOW, ... > > > Running `git fetch --filter=...` doesn't update the remote's partial > > clone filter in the config, and IMO it shouldn't for the above reason. > > ... isn't "git fetch --fitler" that does not update the configured > filter (and does not do a refetch automatically) a bug that made the > "refetch" necessary in the first place? > > Or perhaps I read Calvin incorrectly?
Hi, On Wed, 9 Mar 2022 at 21:32, Junio C Hamano <gitster@pobox.com> wrote: > > The way I read Calvin's suggestion was that you won't allow such a > random series of "git fetch"es without updating the "this is the > filter that is consistent with the contents of this repository" > record, which will lead to inconsistencies. I.e. > > - we must maintain the "filter that is consistent with the contents > of this repository", which this series does not do, but we should. I don't think we should strive to keep this "consistency" — > - the "--refetch" is unnecessary and redundant, as long as such a > record is maintained; when a filter settings changes, we should > do the equivalent of "--refetch" automatically. — we don't know how much data has been pulled in by fetches from different promisor and non-promisor remotes (past & present); or dynamically faulted in through branch switching or history exploration. And I can't see any particular benefit in attempting to keep track of that? Ævar suggested in future maybe we could figure out which commits a user definitively has all the blobs & trees for and refetch could negotiate from that position to improve efficiency: nothing in this series precludes such an enhancement. > ... isn't "git fetch --fitler" that does not update the configured > filter (and does not do a refetch automatically) a bug that made the > "refetch" necessary in the first place? I don't believe it's a bug. A fairly obvious partial clone example I've used before on repos where I want the commit history but not all the associated data (especially when the history is littered with giant blobs I don't care about): git clone example.com/myrepo --filter=blob:none # does a partial clone with no blobs # checkout faults in the blobs present at HEAD in bulk to populate the working tree git config --unset remote.origin.partialclonefilter # going forward, future fetches include all associated blobs for new commits Getting all the blobs for all history is something I'm explicitly trying not to do in this example, but if the next fetch from origin automatically did a "refetch" after I removed the filter that's exactly what would happen. We don't expect users to update `diff.algorithm` in config to run a minimal diff: using the `--diff-algorithm=` option on the command line overrides the config. And the same philosophy applies with fetch: `remote.<name>.partialclonefilter` provides the default filter for fetches, and a user can override it via `git fetch --filter=`. To me this is how Git commands are expected to work. Partial clones are still relatively new and advanced, and I don't believe we should try and over-predict too much what the correct behaviour is for a user. I'd be happy adding something to the documentation for the `remote.<name>.partialclonefilter` config setting to explain that changing or removing the filter won't backfill the local object DB and the user would need `fetch --refetch` for that. Thanks, Rob :)
Hi Robert, Documentation for the config setting is an acceptable solution! Apologies for the late response -- wanted to wait and see if anyone else on the list had any last thoughts. Also I noticed you were hoping that Jonathan Tan could take a look at your patch on the What's Cooking thread. Before I sent my first review out, I discussed your patch with him so he's been briefed. Reviewed-by: Calvin Wan <calvinwan@google.com> On Thu, Mar 10, 2022 at 6:29 AM Robert Coup <robert@coup.net.nz> wrote: > > Hi, > > On Wed, 9 Mar 2022 at 21:32, Junio C Hamano <gitster@pobox.com> wrote: > > > > The way I read Calvin's suggestion was that you won't allow such a > > random series of "git fetch"es without updating the "this is the > > filter that is consistent with the contents of this repository" > > record, which will lead to inconsistencies. I.e. > > > > - we must maintain the "filter that is consistent with the contents > > of this repository", which this series does not do, but we should. > > I don't think we should strive to keep this "consistency" — > > > - the "--refetch" is unnecessary and redundant, as long as such a > > record is maintained; when a filter settings changes, we should > > do the equivalent of "--refetch" automatically. > > — we don't know how much data has been pulled in by fetches from > different promisor and non-promisor remotes (past & present); or > dynamically faulted in through branch switching or history > exploration. And I can't see any particular benefit in attempting to > keep track of that? > > Ævar suggested in future maybe we could figure out which commits a > user definitively has all the blobs & trees for and refetch could > negotiate from that position to improve efficiency: nothing in this > series precludes such an enhancement. > > > ... isn't "git fetch --fitler" that does not update the configured > > filter (and does not do a refetch automatically) a bug that made the > > "refetch" necessary in the first place? > > I don't believe it's a bug. A fairly obvious partial clone example > I've used before on repos where I want the commit history but not all > the associated data (especially when the history is littered with > giant blobs I don't care about): > > git clone example.com/myrepo --filter=blob:none > # does a partial clone with no blobs > # checkout faults in the blobs present at HEAD in bulk to populate > the working tree > git config --unset remote.origin.partialclonefilter > # going forward, future fetches include all associated blobs for new commits > > Getting all the blobs for all history is something I'm explicitly > trying not to do in this example, but if the next fetch from origin > automatically did a "refetch" after I removed the filter that's > exactly what would happen. > > We don't expect users to update `diff.algorithm` in config to run a > minimal diff: using the `--diff-algorithm=` option on the command line > overrides the config. And the same philosophy applies with fetch: > `remote.<name>.partialclonefilter` provides the default filter for > fetches, and a user can override it via `git fetch --filter=`. To me > this is how Git commands are expected to work. > > Partial clones are still relatively new and advanced, and I don't > believe we should try and over-predict too much what the correct > behaviour is for a user. > > I'd be happy adding something to the documentation for the > `remote.<name>.partialclonefilter` config setting to explain that > changing or removing the filter won't backfill the local object DB and > the user would need `fetch --refetch` for that. > > Thanks, > Rob :)
Hi Calvin On Mon, 21 Mar 2022 at 17:58, Calvin Wan <calvinwan@google.com> wrote: > Documentation for the config setting is an acceptable solution! Great :) > Apologies for the late response -- wanted to wait and see if anyone > else on the list had any last thoughts. Also I noticed you were hoping > that Jonathan Tan could take a look at your patch on the What's > Cooking thread. Before I sent my first review out, I discussed your > patch with him so he's been briefed. > > Reviewed-by: Calvin Wan <calvinwan@google.com> Thank you. @Junio - I'm on leave for the remainder of this week, so expect the re-roll sometime next week. Rob :)