[RFC,0/5] Remove git-filter-branch from git.git; host it elsewhere
mbox series

Message ID 20190826235226.15386-1-newren@gmail.com
Headers show
Series
  • Remove git-filter-branch from git.git; host it elsewhere
Related show

Message

Elijah Newren Aug. 26, 2019, 11:52 p.m. UTC
Following up on the suggestion to make git.git smaller and shed non-core
tools, here's an RFC series to do so with git-filter-branch.  This
series first removes dependencies on git-filter-branch (of which there
were very few), and then deletes git-filter-branch itself in the final
commit.

I'm more than happy to consider alternate places for the filter-branch
history (I had considered just merging it in with git-filter-repo), but
for now I just made it available here:
        https://github.com/newren/git-filter-branch

The rewrite above contains the history of the files deleted in Patch 5,
plus a one-time copy of relevant build files (Makefiles, test-lib.sh,
etc. -- I didn't want the whole history of these), and then touchups to
streamline the build files and make them all work in this standalone
repo.


Some highlevel notes on the patches:

  * Patches 1&2: are good cleanups & performance wins regardless of
    whether the rest of the series is taken
    
  * Patch 3: an attempt to improve i18n situation for external scripts,
    but discovered to not be necessary/useful for git-filter-branch
    specifically

  * Patch 4:
    * If we are good with deleting git-filter-branch now and just noting
      it in the release notes, then patch 4 could be simplified; there's
      no need to update git-filter-branch.txt in that case.
    * If, however, we want to do some external messaging for an
      additional release cycle or two before moving git-filter-branch
      out of git.git, this patch will help us until then to at least
      avoid recommending a tool which will likely mangle user's data in
      unexpected ways.  But it'd be really helpful if folks could review
      and opine on the BFG stuff if so.

  * Patch 5: actually deletes git-filter-branch, its tests, and
    documentation.


Elijah Newren (5):
  t6006: simplify and optimize empty message test
  t3427: accelerate this test by using fast-export and fast-import
  git-sh-i18n: work with external scripts
  Recommend git-filter-repo instead of git-filter-branch in
    documentation
  Remove git-filter-branch, it is now external to git.git

 .gitignore                          |   1 -
 Documentation/git-fast-export.txt   |   6 +-
 Documentation/git-filter-branch.txt | 481 --------------------
 Documentation/git-gc.txt            |  17 +-
 Documentation/git-rebase.txt        |   2 +-
 Documentation/git-replace.txt       |  10 +-
 Documentation/git-svn.txt           |   4 +-
 Documentation/githooks.txt          |   7 +-
 Makefile                            |   1 -
 command-list.txt                    |   1 -
 contrib/svn-fe/svn-fe.txt           |   4 +-
 git-filter-branch.sh                | 662 ----------------------------
 git-sh-i18n.sh                      |   7 +-
 t/perf/p7000-filter-branch.sh       |  24 -
 t/t3427-rebase-subtree.sh           |  32 +-
 t/t6006-rev-list-format.sh          |   5 +-
 t/t7003-filter-branch.sh            | 505 ---------------------
 t/t7009-filter-branch-null-sha1.sh  |  55 ---
 t/t9902-completion.sh               |  12 +-
 19 files changed, 63 insertions(+), 1773 deletions(-)
 delete mode 100644 Documentation/git-filter-branch.txt
 delete mode 100755 git-filter-branch.sh
 delete mode 100755 t/perf/p7000-filter-branch.sh
 delete mode 100755 t/t7003-filter-branch.sh
 delete mode 100755 t/t7009-filter-branch-null-sha1.sh

Comments

Derrick Stolee Aug. 27, 2019, 1:39 a.m. UTC | #1
On 8/26/2019 7:52 PM, Elijah Newren wrote:
> Following up on the suggestion to make git.git smaller and shed non-core
> tools, here's an RFC series to do so with git-filter-branch.  This
> series first removes dependencies on git-filter-branch (of which there
> were very few), and then deletes git-filter-branch itself in the final
> commit.
> 
> I'm more than happy to consider alternate places for the filter-branch
> history (I had considered just merging it in with git-filter-repo), but
> for now I just made it available here:
>         https://github.com/newren/git-filter-branch
> 
> The rewrite above contains the history of the files deleted in Patch 5,
> plus a one-time copy of relevant build files (Makefiles, test-lib.sh,
> etc. -- I didn't want the whole history of these), and then touchups to
> streamline the build files and make them all work in this standalone
> repo.
> 
> 
> Some highlevel notes on the patches:
> 
>   * Patches 1&2: are good cleanups & performance wins regardless of
>     whether the rest of the series is taken

I agree! These are great. I just had a nit about extracting a helper
instead of copy-pasting the same three lines in multiple tests.

>   * Patch 3: an attempt to improve i18n situation for external scripts,
>     but discovered to not be necessary/useful for git-filter-branch
>     specifically

I'm not sure this is super-important now, but could be saved for a
later date, when it is important.

>   * Patch 4:
>     * If we are good with deleting git-filter-branch now and just noting
>       it in the release notes, then patch 4 could be simplified; there's
>       no need to update git-filter-branch.txt in that case.
>     * If, however, we want to do some external messaging for an
>       additional release cycle or two before moving git-filter-branch
>       out of git.git, this patch will help us until then to at least
>       avoid recommending a tool which will likely mangle user's data in
>       unexpected ways.  But it'd be really helpful if folks could review
>       and opine on the BFG stuff if so.

I think this is a good step, and should be taken even if we never
plan to take Patch 5.

>   * Patch 5: actually deletes git-filter-branch, its tests, and
>     documentation.

This is the one where others need to chime in with opinions. I
think this one can only be taken if we have a concrete plan about
how to support the tool _somehow_, even if it is "go download the
script from this place; it may have broken since we last tested it."

Yes, we want to strongly recommend that people use newer, better
tools. That's not always something users can accept. Having the
tool live somewhere that is accessible can appease some users for
a while, and it can decay and die a slow death there.

Thanks,
-Stolee
Elijah Newren Aug. 27, 2019, 6:17 a.m. UTC | #2
On Mon, Aug 26, 2019 at 6:39 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 8/26/2019 7:52 PM, Elijah Newren wrote:
> > Following up on the suggestion to make git.git smaller and shed non-core
> > tools, here's an RFC series to do so with git-filter-branch.  This
> > series first removes dependencies on git-filter-branch (of which there
> > were very few), and then deletes git-filter-branch itself in the final
> > commit.
> >
> > I'm more than happy to consider alternate places for the filter-branch
> > history (I had considered just merging it in with git-filter-repo), but
> > for now I just made it available here:
> >         https://github.com/newren/git-filter-branch
> >
> > The rewrite above contains the history of the files deleted in Patch 5,
> > plus a one-time copy of relevant build files (Makefiles, test-lib.sh,
> > etc. -- I didn't want the whole history of these), and then touchups to
> > streamline the build files and make them all work in this standalone
> > repo.
> >
> >
> > Some highlevel notes on the patches:
> >
> >   * Patches 1&2: are good cleanups & performance wins regardless of
> >     whether the rest of the series is taken
>
> I agree! These are great. I just had a nit about extracting a helper
> instead of copy-pasting the same three lines in multiple tests.
>
> >   * Patch 3: an attempt to improve i18n situation for external scripts,
> >     but discovered to not be necessary/useful for git-filter-branch
> >     specifically
>
> I'm not sure this is super-important now, but could be saved for a
> later date, when it is important.
>
> >   * Patch 4:
> >     * If we are good with deleting git-filter-branch now and just noting
> >       it in the release notes, then patch 4 could be simplified; there's
> >       no need to update git-filter-branch.txt in that case.
> >     * If, however, we want to do some external messaging for an
> >       additional release cycle or two before moving git-filter-branch
> >       out of git.git, this patch will help us until then to at least
> >       avoid recommending a tool which will likely mangle user's data in
> >       unexpected ways.  But it'd be really helpful if folks could review
> >       and opine on the BFG stuff if so.
>
> I think this is a good step, and should be taken even if we never
> plan to take Patch 5.
>
> >   * Patch 5: actually deletes git-filter-branch, its tests, and
> >     documentation.
>
> This is the one where others need to chime in with opinions. I
> think this one can only be taken if we have a concrete plan about
> how to support the tool _somehow_, even if it is "go download the
> script from this place; it may have broken since we last tested it."
>
> Yes, we want to strongly recommend that people use newer, better
> tools. That's not always something users can accept. Having the
> tool live somewhere that is accessible can appease some users for
> a while, and it can decay and die a slow death there.

Perhaps I should add some more words about the separate repo I
created; even though it wasn't one of the five patches in this series
it actually represented the lionshare of the work before I submitted
this.  Anyway, it has a Makefile which supports the normal 'test',
'doc', 'clean', 'install' (and variants), and 'dist' (and variants).
The 'test' target will run all the filter-branch tests taken from the
git.git testsuite (i.e. t7003 and t7009) without requiring a version
of git built inside that separate repo, 'doc' will build both html and
manpages, etc.  It doesn't look at any config.mak* files and has
stripped out lots of stuff from the main repo, but it's relatively
minimal and self-contained beyond an assumption that a normal copy of
git has been installed somehow already.  Given how infrequently
filter-branch has needed fixes in the past, and the fact that it is
pretty good about relying on plumbing rather than porcelain, I suspect
it might actually be a pretty light maintenance load to keep it
running for a good long time.
Eric Wong Aug. 27, 2019, 7:03 a.m. UTC | #3
Elijah Newren <newren@gmail.com> wrote:
> Some highlevel notes on the patches:
> 
>   * Patches 1&2: are good cleanups & performance wins regardless of
>     whether the rest of the series is taken

Agreed.  Though weren't we moving away from pipes in tests
because failures could go unnoticed?  (I haven't been paying too
much attention, though)

>   * Patch 5: actually deletes git-filter-branch, its tests, and
>     documentation.

Given how long we've had git-filter-branch, I suggest we keep it
around but add a warning at runtime notifying users of it's
impending removal.  Such warning should remain for >=10 years or
whatever an distro support cycle is, nowadays.

And there should probably be a way to disable the warning via
git-config if it's too annoying.

AFAIK, filter-branch is not causing support headaches for any
git developers today.  With so many commands in git, it's
unlikely newbies will ever get around to discover it :)
So I think think we should be in any rush to remove it.

But I agree that filter-branch isn't useful and certainly
shouldn't be encouraged/promoted.

Yet there's probably still users which ARE happy with it, that
will never hit the edge cases and problems it poses; and will
never read release notes.  And said users are probably getting
git from a slow-moving distro, so it'd be a disservice to them
if they lost a tool they depend on without any warning.
Sergey Organov Aug. 27, 2019, 8:43 a.m. UTC | #4
Eric Wong <e@80x24.org> writes:


[...]

> AFAIK, filter-branch is not causing support headaches for any
> git developers today.  With so many commands in git, it's
> unlikely newbies will ever get around to discover it :)
> So I think think we should be in any rush to remove it.

Nah, discovering it is simple. Just Google for "git change author". That
eventually leads to a script that uses "git filter-branch --env-filter"
to get the job done, and I'm afraid it is spread all over the world.

See, e.g.:

https://help.github.com/en/articles/changing-author-info

> But I agree that filter-branch isn't useful and certainly
> shouldn't be encouraged/promoted.

Well, is there more suitable way to change author for a (large) set of
commits then?

> Yet there's probably still users which ARE happy with it, that
> will never hit the edge cases and problems it poses; and will
> never read release notes.  And said users are probably getting
> git from a slow-moving distro, so it'd be a disservice to them
> if they lost a tool they depend on without any warning.

Personally, I'm far from happy with it, but I have no clue how to
substitute it in the job above. Anybody?

-- Sergey
Elijah Newren Aug. 27, 2019, 10:18 p.m. UTC | #5
On Tue, Aug 27, 2019 at 1:43 AM Sergey Organov <sorganov@gmail.com> wrote:
>
> Eric Wong <e@80x24.org> writes:
>
>
> [...]
>
> > AFAIK, filter-branch is not causing support headaches for any
> > git developers today.  With so many commands in git, it's
> > unlikely newbies will ever get around to discover it :)
> > So I think think we should be in any rush to remove it.
>
> Nah, discovering it is simple. Just Google for "git change author". That
> eventually leads to a script that uses "git filter-branch --env-filter"
> to get the job done, and I'm afraid it is spread all over the world.
>
> See, e.g.:
>
> https://help.github.com/en/articles/changing-author-info

Side note: Is the goal to "fix names and email addresses in this
repository"?  If so, this guide fails: it doesn't update tagger names
or email addresses.  Indeed, filter-branch doesn't provide a way to do
that.  (Not to mention other problems like not updating references to
commit hashes in commit messages when it busy rewriting everything.)

> > But I agree that filter-branch isn't useful and certainly
> > shouldn't be encouraged/promoted.
>
> Well, is there more suitable way to change author for a (large) set of
> commits then?

I would say yes, use git filter-repo (note that this thread started
with me proposing filter-repo for inclusion in git.git -- and getting
suggestions that we should remove stuff instead of adding more stuff).
I'm biased, but I think it's much better at this particular job as
well:


You can create a mailmap file and pass it to the --mailmap option to
git-filter-repo.

Or, if you prefer (perhaps you don't like git's mailmap format as used
by shortlog and now log, or perhaps you really want to be able to do
regex replacement or something), you can use the --name-callback or
--email-callback to work on those fields more directly.

Or, if you prefer (e.g. you want to handle author vs. committer vs.
tagger differently), you can use the --commit-callback and
--tag-callback filters.


As an added bonus, filter-repo will also perform the rewrite far
faster than filter-branch (and rewrite commit hashes in commit
messages as alluded to above).

> > Yet there's probably still users which ARE happy with it, that
> > will never hit the edge cases and problems it poses; and will
> > never read release notes.  And said users are probably getting
> > git from a slow-moving distro, so it'd be a disservice to them
> > if they lost a tool they depend on without any warning.
>
> Personally, I'm far from happy with it, but I have no clue how to
> substitute it in the job above. Anybody?

The start of this thread where I proposed git filter-repo for
inclusion in git[1] had links to documentation and comparisons to
other tools and such.  You may find those links helpful; if not, let
me know what needs to be fixed in the documentation.

Elijah

[1] https://public-inbox.org/git/CABPp-BEr8LVM+yWTbi76hAq7Moe1hyp2xqxXfgVV4_teh_9skA@mail.gmail.com/
Sergey Organov Aug. 28, 2019, 8:52 a.m. UTC | #6
Elijah Newren <newren@gmail.com> writes:

> On Tue, Aug 27, 2019 at 1:43 AM Sergey Organov <sorganov@gmail.com> wrote:
>>
>> Eric Wong <e@80x24.org> writes:
>>
>>
>> [...]
>>
>> > AFAIK, filter-branch is not causing support headaches for any
>> > git developers today.  With so many commands in git, it's
>> > unlikely newbies will ever get around to discover it :)
>> > So I think think we should be in any rush to remove it.
>>
>> Nah, discovering it is simple. Just Google for "git change author". That
>> eventually leads to a script that uses "git filter-branch --env-filter"
>> to get the job done, and I'm afraid it is spread all over the world.
>>
>> See, e.g.:
>>
>> https://help.github.com/en/articles/changing-author-info
>
> Side note: Is the goal to "fix names and email addresses in this
> repository"?  If so, this guide fails: it doesn't update tagger names
> or email addresses.  Indeed, filter-branch doesn't provide a way to do
> that.  (Not to mention other problems like not updating references to
> commit hashes in commit messages when it busy rewriting everything.)

No. Maybe the original goal was like that, by I, personally, use
modified version of this to change my "Author" credentials from
"internal" to "public" in branches that I'm going to send upstream, so
the actual aim is to change e-mail of particular Author from a@b to c@d
in all the commits in a (feature) branch.

>
>> > But I agree that filter-branch isn't useful and certainly
>> > shouldn't be encouraged/promoted.
>>
>> Well, is there more suitable way to change author for a (large) set of
>> commits then?
>
> I would say yes, use git filter-repo (note that this thread started
> with me proposing filter-repo for inclusion in git.git -- and getting
> suggestions that we should remove stuff instead of adding more stuff).
> I'm biased, but I think it's much better at this particular job as
> well:

Well, I don't want to change the entire repo, and I don't immediately
see how to do it with git filter-repo. Is it at all possible?

> You can create a mailmap file and pass it to the --mailmap option to
> git-filter-repo.
>
> Or, if you prefer (perhaps you don't like git's mailmap format as used
> by shortlog and now log, or perhaps you really want to be able to do
> regex replacement or something), you can use the --name-callback or
> --email-callback to work on those fields more directly.
>
> Or, if you prefer (e.g. you want to handle author vs. committer vs.
> tagger differently), you can use the --commit-callback and
> --tag-callback filters.
>
> As an added bonus, filter-repo will also perform the rewrite far
> faster than filter-branch (and rewrite commit hashes in commit
> messages as alluded to above).

These things are nice to have indeed, but it always changes the entire
repo, right? If so, it's not a suitable substitute for git-filter-branch
for particular job at hand.

Actually, I'd rather expect some support for this in "git rebase", being
git history editing/reshaping tool, but it looks like it only has it in
the form that is very difficult to automate.

>
>> > Yet there's probably still users which ARE happy with it, that
>> > will never hit the edge cases and problems it poses; and will
>> > never read release notes.  And said users are probably getting
>> > git from a slow-moving distro, so it'd be a disservice to them
>> > if they lost a tool they depend on without any warning.
>>
>> Personally, I'm far from happy with it, but I have no clue how to
>> substitute it in the job above. Anybody?
>
> The start of this thread where I proposed git filter-repo for
> inclusion in git[1] had links to documentation and comparisons to
> other tools and such.  You may find those links helpful; if not, let
> me know what needs to be fixed in the documentation.

Thank you for the references, I find it a very nice tool to have!

Pity it's not an entire substitute for git filter-branch.

-- Sergey
Elijah Newren Aug. 28, 2019, 5:16 p.m. UTC | #7
Hi Sergey,

On Wed, Aug 28, 2019 at 1:52 AM Sergey Organov <sorganov@gmail.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Tue, Aug 27, 2019 at 1:43 AM Sergey Organov <sorganov@gmail.com> wrote:
> >>
> >> Eric Wong <e@80x24.org> writes:
> >>
> >>
> >> [...]
> >>
> >> > AFAIK, filter-branch is not causing support headaches for any
> >> > git developers today.  With so many commands in git, it's
> >> > unlikely newbies will ever get around to discover it :)
> >> > So I think think we should be in any rush to remove it.
> >>
> >> Nah, discovering it is simple. Just Google for "git change author". That
> >> eventually leads to a script that uses "git filter-branch --env-filter"
> >> to get the job done, and I'm afraid it is spread all over the world.
> >>
> >> See, e.g.:
> >>
> >> https://help.github.com/en/articles/changing-author-info
> >
> > Side note: Is the goal to "fix names and email addresses in this
> > repository"?  If so, this guide fails: it doesn't update tagger names
> > or email addresses.  Indeed, filter-branch doesn't provide a way to do
> > that.  (Not to mention other problems like not updating references to
> > commit hashes in commit messages when it busy rewriting everything.)
>
> No. Maybe the original goal was like that, by I, personally, use
> modified version of this to change my "Author" credentials from
> "internal" to "public" in branches that I'm going to send upstream, so
> the actual aim is to change e-mail of particular Author from a@b to c@d
> in all the commits in a (feature) branch.

There's an interesting usecase I hadn't heard of or thought of before.
Quick question to see if I'm understanding correctly: "all commits in
a branch" or "all commits *unique* to a branch"?

(Perhaps the only commits with the author you want to change are among
the commits that are unique to that branch and so the distinction
doesn't matter, but it wasn't clear from the description.)

> >> > But I agree that filter-branch isn't useful and certainly
> >> > shouldn't be encouraged/promoted.
> >>
> >> Well, is there more suitable way to change author for a (large) set of
> >> commits then?
> >
> > I would say yes, use git filter-repo (note that this thread started
> > with me proposing filter-repo for inclusion in git.git -- and getting
> > suggestions that we should remove stuff instead of adding more stuff).
> > I'm biased, but I think it's much better at this particular job as
> > well:
>
> Well, I don't want to change the entire repo, and I don't immediately
> see how to do it with git filter-repo. Is it at all possible?

Yes, it is possible.  filter-repo has a hidden --refs argument
defaulting to --all; you could instead set it to e.g.
origin/master..master.

--refs is the only hidden option in filter-repo.  I know it may look
funny that I spent a bunch of effort to create the
--reference-excluded-parents option to fast-export explicitly so that
it would be possible to do partial history rewrites like this, and
then to hide and avoid documenting this option (though I did hint that
it existed in the documentation if you search for "Partial-repo
filtering"), but there was a few reasons for this:

  * mixing old and new history for most rewrites that
filter-branch/filter-repo/bfg/etc are used for can really mess things
up and make it hard to recover from.  I don't like trying to clean up
repos with accidental duplicate copies of most commits in the repo,
and I suspect others like it even less.  So, anything that makes it
easier to make such mistakes needs to have a really good rationale in
order for me to expose it.
  * The only usecases I knew of for partial repo filtering prior to
your email were (1) side-stepping insanely slow execution time of poor
filtering tools like filter-branch, and (2) performing operations
better suited to git-rebase anyway (e.g. the --signoff option to
rebase did not exist once upon a time and so folks could have used
filter-branch to fake it, but using rebase is the better way to make
this change).  And, even after your email, I'm not sure that has
changed though, as noted below.

> > You can create a mailmap file and pass it to the --mailmap option to
> > git-filter-repo.
> >
> > Or, if you prefer (perhaps you don't like git's mailmap format as used
> > by shortlog and now log, or perhaps you really want to be able to do
> > regex replacement or something), you can use the --name-callback or
> > --email-callback to work on those fields more directly.
> >
> > Or, if you prefer (e.g. you want to handle author vs. committer vs.
> > tagger differently), you can use the --commit-callback and
> > --tag-callback filters.
> >
> > As an added bonus, filter-repo will also perform the rewrite far
> > faster than filter-branch (and rewrite commit hashes in commit
> > messages as alluded to above).
>
> These things are nice to have indeed, but it always changes the entire
> repo, right? If so, it's not a suitable substitute for git-filter-branch
> for particular job at hand.

It *defaults* to changing the entire repo.  You may also be interested
to note that two of the demos in contrib/filter-repo-demos[1]
explicitly make use of partial history filtering -- one of the two
being the filter-branch reimplementation (I created a script that
reimplemented filter-branch on top of filter-repo and made it accept
the exact same flags as filter-branch.  That script passes all the
filter-branch regression tests from git.git and it's much faster than
filter-branch, though it's still glacially slow compared to
filter-repo, and has all the safety problems that filter-branch does).

[1] https://github.com/newren/git-filter-repo/tree/master/contrib/filter-repo-demos

> Actually, I'd rather expect some support for this in "git rebase", being
> git history editing/reshaping tool, but it looks like it only has it in
> the form that is very difficult to automate.

I agree that git rebase would be the better choice here; I typically
feel it's the better choice for rewrites of recent history.  I think
it provides just what you need:

  git rebase --exec="git commit --amend --reset-author -C HEAD" $UPSTREAM

(Assuming, of course, that you've either set the right environment
variables or set user.name and user.email to the new values you want
so that commit's --reset-author flag can reset to the *new* author.)

> >> > Yet there's probably still users which ARE happy with it, that
> >> > will never hit the edge cases and problems it poses; and will
> >> > never read release notes.  And said users are probably getting
> >> > git from a slow-moving distro, so it'd be a disservice to them
> >> > if they lost a tool they depend on without any warning.
> >>
> >> Personally, I'm far from happy with it, but I have no clue how to
> >> substitute it in the job above. Anybody?
> >
> > The start of this thread where I proposed git filter-repo for
> > inclusion in git[1] had links to documentation and comparisons to
> > other tools and such.  You may find those links helpful; if not, let
> > me know what needs to be fixed in the documentation.
>
> Thank you for the references, I find it a very nice tool to have!
>
> Pity it's not an entire substitute for git filter-branch.

Au contraire, I believe it is.  :-)

Thanks for the interesting usecase.  It sounds like we both think this
one happens to be better solved by rebase, and the command snippet I
provided above should show you to use rebase to solve it.  However, if
you come up with any others where partial repo filtering makes sense,
I'm always willing to reconsider my decision to make the --refs
argument hidden; it may just mean adding more warnings, but it might
also involve changing other defaults (e.g. the automatic
repacking/pruning).  I'd need concrete usecases to know for sure how
I'd want to handle it.

Hope that helps,
Elijah
Sergey Organov Aug. 28, 2019, 7:03 p.m. UTC | #8
Hi Elijah,

Elijah Newren <newren@gmail.com> writes:

> Hi Sergey,
>
> On Wed, Aug 28, 2019 at 1:52 AM Sergey Organov <sorganov@gmail.com> wrote:
>>
>> Elijah Newren <newren@gmail.com> writes:
>>
>> > On Tue, Aug 27, 2019 at 1:43 AM Sergey Organov <sorganov@gmail.com> wrote:
>> >>
>> >> Eric Wong <e@80x24.org> writes:
>> >>
>> >>
>> >> [...]

[...]

>> >
>> > Side note: Is the goal to "fix names and email addresses in this
>> > repository"?  If so, this guide fails: it doesn't update tagger names
>> > or email addresses.  Indeed, filter-branch doesn't provide a way to do
>> > that.  (Not to mention other problems like not updating references to
>> > commit hashes in commit messages when it busy rewriting everything.)
>>
>> No. Maybe the original goal was like that, by I, personally, use
>> modified version of this to change my "Author" credentials from
>> "internal" to "public" in branches that I'm going to send upstream, so
>> the actual aim is to change e-mail of particular Author from a@b to c@d
>> in all the commits in a (feature) branch.
>
> There's an interesting usecase I hadn't heard of or thought of before.
> Quick question to see if I'm understanding correctly: "all commits in
> a branch" or "all commits *unique* to a branch"?
>
> (Perhaps the only commits with the author you want to change are among
> the commits that are unique to that branch and so the distinction
> doesn't matter, but it wasn't clear from the description.)

Yes, this is exactly the case for me, as I'm changing entirely linear
topic branch that is going to become patch series to send out. No
complications.

>
>> >> > But I agree that filter-branch isn't useful and certainly
>> >> > shouldn't be encouraged/promoted.
>> >>
>> >> Well, is there more suitable way to change author for a (large) set of
>> >> commits then?
>> >
>> > I would say yes, use git filter-repo (note that this thread started
>> > with me proposing filter-repo for inclusion in git.git -- and getting
>> > suggestions that we should remove stuff instead of adding more stuff).
>> > I'm biased, but I think it's much better at this particular job as
>> > well:
>>
>> Well, I don't want to change the entire repo, and I don't immediately
>> see how to do it with git filter-repo. Is it at all possible?
>
> Yes, it is possible.  filter-repo has a hidden --refs argument
> defaulting to --all; you could instead set it to e.g.
> origin/master..master.

Cool!

>
> --refs is the only hidden option in filter-repo.  I know it may look
> funny that I spent a bunch of effort to create the
> --reference-excluded-parents option to fast-export explicitly so that
> it would be possible to do partial history rewrites like this, and
> then to hide and avoid documenting this option (though I did hint that
> it existed in the documentation if you search for "Partial-repo
> filtering"), but there was a few reasons for this:
>
>   * mixing old and new history for most rewrites that
> filter-branch/filter-repo/bfg/etc are used for can really mess things
> up and make it hard to recover from.  I don't like trying to clean up
> repos with accidental duplicate copies of most commits in the repo,
> and I suspect others like it even less.  So, anything that makes it
> easier to make such mistakes needs to have a really good rationale in
> order for me to expose it.
>   * The only usecases I knew of for partial repo filtering prior to
> your email were (1) side-stepping insanely slow execution time of poor
> filtering tools like filter-branch, and (2) performing operations
> better suited to git-rebase anyway (e.g. the --signoff option to
> rebase did not exist once upon a time and so folks could have used
> filter-branch to fake it, but using rebase is the better way to make
> this change).  And, even after your email, I'm not sure that has
> changed though, as noted below.

Yeah, I share your worries.

[...]

>> Actually, I'd rather expect some support for this in "git rebase", being
>> git history editing/reshaping tool, but it looks like it only has it in
>> the form that is very difficult to automate.
>
> I agree that git rebase would be the better choice here; I typically
> feel it's the better choice for rewrites of recent history.  I think
> it provides just what you need:
>
>   git rebase --exec="git commit --amend --reset-author -C HEAD" $UPSTREAM
>
> (Assuming, of course, that you've either set the right environment
> variables or set user.name and user.email to the new values you want
> so that commit's --reset-author flag can reset to the *new* author.)

This should do the trick for me most of times, thanks a lot for the clue!

However, the script that I'm using doesn't change _all_ the authors, it
only changes those that match particular specific author specified in
the script. I didn't yet actually need this feature, but I can well
imagine it's probable that I will have commits by other author(s) in the
branch and I won't want to attribute their job to myself.

Hmm... That said, using the generic "--exec" to "git rebase" I could
probably come-up with a script that will check the Author of the latest
commit and will choose to either rewrite it or not. Nothing terribly
complex.

>
>> >> > Yet there's probably still users which ARE happy with it, that
>> >> > will never hit the edge cases and problems it poses; and will
>> >> > never read release notes.  And said users are probably getting
>> >> > git from a slow-moving distro, so it'd be a disservice to them
>> >> > if they lost a tool they depend on without any warning.
>> >>
>> >> Personally, I'm far from happy with it, but I have no clue how to
>> >> substitute it in the job above. Anybody?
>> >
>> > The start of this thread where I proposed git filter-repo for
>> > inclusion in git[1] had links to documentation and comparisons to
>> > other tools and such.  You may find those links helpful; if not, let
>> > me know what needs to be fixed in the documentation.
>>
>> Thank you for the references, I find it a very nice tool to have!
>>
>> Pity it's not an entire substitute for git filter-branch.
>
> Au contraire, I believe it is.  :-)

I take your word for it :-)

>
> Thanks for the interesting usecase.  It sounds like we both think this
> one happens to be better solved by rebase, and the command snippet I
> provided above should show you to use rebase to solve it.  However, if
> you come up with any others where partial repo filtering makes sense,
> I'm always willing to reconsider my decision to make the --refs
> argument hidden; it may just mean adding more warnings, but it might
> also involve changing other defaults (e.g. the automatic
> repacking/pruning).  I'd need concrete usecases to know for sure how
> I'd want to handle it.

OK, thanks a lot! Doesn't seem to be necessary for now due to the rebase
trick you've suggested.

>
> Hope that helps,

Sure it does!

-- Sergey
Johannes Schindelin Aug. 30, 2019, 8:40 p.m. UTC | #9
Hi Elijah,


On Wed, 28 Aug 2019, Elijah Newren wrote:

> Hi Sergey,
>
> On Wed, Aug 28, 2019 at 1:52 AM Sergey Organov <sorganov@gmail.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> writes:
> >
> > > On Tue, Aug 27, 2019 at 1:43 AM Sergey Organov <sorganov@gmail.com> wrote:
> > >>
> > >> Eric Wong <e@80x24.org> writes:
> > >>
> > >>
> > >> [...]
> > >>
> > >> > AFAIK, filter-branch is not causing support headaches for any
> > >> > git developers today.  With so many commands in git, it's
> > >> > unlikely newbies will ever get around to discover it :)
> > >> > So I think think we should be in any rush to remove it.
> > >>
> > >> Nah, discovering it is simple. Just Google for "git change author". That
> > >> eventually leads to a script that uses "git filter-branch --env-filter"
> > >> to get the job done, and I'm afraid it is spread all over the world.
> > >>
> > >> See, e.g.:
> > >>
> > >> https://help.github.com/en/articles/changing-author-info
> > >
> > > Side note: Is the goal to "fix names and email addresses in this
> > > repository"?  If so, this guide fails: it doesn't update tagger names
> > > or email addresses.  Indeed, filter-branch doesn't provide a way to do
> > > that.  (Not to mention other problems like not updating references to
> > > commit hashes in commit messages when it busy rewriting everything.)
> >
> > No. Maybe the original goal was like that, by I, personally, use
> > modified version of this to change my "Author" credentials from
> > "internal" to "public" in branches that I'm going to send upstream, so
> > the actual aim is to change e-mail of particular Author from a@b to c@d
> > in all the commits in a (feature) branch.
>
> There's an interesting usecase I hadn't heard of or thought of before.

I'll throw in another use case that's kinda related: extracting the
history of one file (or subdirectory).

In my most recent instance of this, I wanted to publish the script I
used to use for submitting patch series to the Git mailing list,
maintaining tags for iterations and generating cover letters from branch
descriptions and interdiffs (this script eventually became GitGitGadget,
https://github.com/gitgitgadget/gitgitgadget/commits?after=6fb0ede48f86e729292ee1542729bc0f5a30cfa6+0
demonstrates this).

To do that, I ran a `git filter-branch` in the repository where I track
all the scripts I deem unsuitable for public consumption, to remove all
files but `mail-patch-series.sh`, then pushed it to
https://github.com/dscho/mail-patch-series

Please note that most crucially, I wanted to rewrite a newly-created
branch, and only that branch.

Could I have done the same using `git fast-export`, filtering the output
with a Perl script, then passing it to `git fast-import`? Sure, I was
really tempted to do that. In the end, it took less of _my_ time to just
let `git filter-branch` do its work with a not-too-complicated index
filter.

In another instance, a long, long time ago, I needed to restart a
repository which had included way too many files for its own good, then
rename the old repository and start with a fresh `master` that contained
but a single commit whose tree was identical to the previous `master`'s
tip commit. I simply grafted that commit, ran `git filter-branch` and
had precisely what I needed.

I would be _delighted_ if these kinds of use case (rewriting a branch,
or even just a commit range) became more of a first-class citizen with
`git filter-repo`.

Thanks,
Dscho
Elijah Newren Aug. 30, 2019, 11:22 p.m. UTC | #10
Hi Dscho,

On Fri, Aug 30, 2019 at 1:40 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Elijah,
>
>
> On Wed, 28 Aug 2019, Elijah Newren wrote:
>
> > Hi Sergey,
> >
> > On Wed, Aug 28, 2019 at 1:52 AM Sergey Organov <sorganov@gmail.com> wrote:
> > >
> > > Elijah Newren <newren@gmail.com> writes:
> > >
> > > > On Tue, Aug 27, 2019 at 1:43 AM Sergey Organov <sorganov@gmail.com> wrote:
> > > >>
> > > >> Eric Wong <e@80x24.org> writes:
> > > >>
> > > >>
> > > >> [...]
> > > >>
> > > >> > AFAIK, filter-branch is not causing support headaches for any
> > > >> > git developers today.  With so many commands in git, it's
> > > >> > unlikely newbies will ever get around to discover it :)
> > > >> > So I think think we should be in any rush to remove it.
> > > >>
> > > >> Nah, discovering it is simple. Just Google for "git change author". That
> > > >> eventually leads to a script that uses "git filter-branch --env-filter"
> > > >> to get the job done, and I'm afraid it is spread all over the world.
> > > >>
> > > >> See, e.g.:
> > > >>
> > > >> https://help.github.com/en/articles/changing-author-info
> > > >
> > > > Side note: Is the goal to "fix names and email addresses in this
> > > > repository"?  If so, this guide fails: it doesn't update tagger names
> > > > or email addresses.  Indeed, filter-branch doesn't provide a way to do
> > > > that.  (Not to mention other problems like not updating references to
> > > > commit hashes in commit messages when it busy rewriting everything.)
> > >
> > > No. Maybe the original goal was like that, by I, personally, use
> > > modified version of this to change my "Author" credentials from
> > > "internal" to "public" in branches that I'm going to send upstream, so
> > > the actual aim is to change e-mail of particular Author from a@b to c@d
> > > in all the commits in a (feature) branch.
> >
> > There's an interesting usecase I hadn't heard of or thought of before.
>
> I'll throw in another use case that's kinda related: extracting the
> history of one file (or subdirectory).

Thanks for sending these along!  I do have some comments, and a bunch
of questions...

> In my most recent instance of this, I wanted to publish the script I
> used to use for submitting patch series to the Git mailing list,
> maintaining tags for iterations and generating cover letters from branch
> descriptions and interdiffs (this script eventually became GitGitGadget,
> https://github.com/gitgitgadget/gitgitgadget/commits?after=6fb0ede48f86e729292ee1542729bc0f5a30cfa6+0
> demonstrates this).
>
> To do that, I ran a `git filter-branch` in the repository where I track
> all the scripts I deem unsuitable for public consumption, to remove all
> files but `mail-patch-series.sh`, then pushed it to
> https://github.com/dscho/mail-patch-series
>
> Please note that most crucially, I wanted to rewrite a newly-created
> branch, and only that branch.
>
> Could I have done the same using `git fast-export`, filtering the output
> with a Perl script, then passing it to `git fast-import`? Sure, I was
> really tempted to do that. In the end, it took less of _my_ time to just
> let `git filter-branch` do its work with a not-too-complicated index
> filter.

Why a perl script?  Shouldn't
    git fast-export [--no-data] HEAD -- $PATH | git fast-import --force --quiet
do the trick?  And it's probably simpler and shorter than the index
filter you used.

That said, yeah it'd be nice to get automatic rewriting of commit
hashes in commit messages and other niceties from filter-repo (e.g.
future automatic reattaching of notes to the rewritten commits).  Some
questions:

  * What's the backup strategy in case you specify the wrong filters
(e.g. you have a typo in the pathnames)?  filter-repo encourages folks
to make a clone and then filter the fresh clone, because if anything
goes awry, you can just delete and restart.  (I am heavily opposed to
the refs/original/ backup mechanism used by filter-branch, for
multiple reasons.)  Is your safety stance just "If I mess up it's my
own fault; do the rewrite?"  Or are you okay with cloning before
filtering?
  * If you're okay with cloning before filtering...then is there an
issue with rewriting all branches, and just pushing the one you need?
(Is there an issue with "this branch is small, the others are huge,
and filter-branch is slow -- so rewriting one branch saves me lots of
time"?  Or are there other issues at play too?)
  * What if the user has auxiliary information for the branch in other
refs?  For example, git-notes pointing at any of the commits, or tags
in the history of the branch that might be relevant, or perhaps even
replace refs in combination with GIT_NO_REPLACE_OBJECTS=1?  Is this an
"I don't care, toss that stuff and just rewrite just this branch?"
  * filter-repo by default creates new replace references so that you
can refer to new commit IDs using old (unabbreviated) commit IDs.
Would that be considered helpful for this usecase?  unhelpful?
irrelevant, since you'll just push the branch you want somewhere and
nuke the temporary clone?


I'm not by any means ruling out the possibility of documenting --refs
and adjusting the defaults when it is used so the user can just run
something like
   git filter-repo --path $PATH --refs $MYBRANCH
but I feel like I need to understand answers to questions like the
above ones so that I can know how to phrase warnings and adjust
defaults and update the documentation.

> In another instance, a long, long time ago, I needed to restart a
> repository which had included way too many files for its own good, then
> rename the old repository and start with a fresh `master` that contained
> but a single commit whose tree was identical to the previous `master`'s
> tip commit. I simply grafted that commit, ran `git filter-branch` and
> had precisely what I needed.

filter-repo supports grafts and replace objects, the same as
filter-branch.  (Although, technically, I didn't have to do a thing to
support it; fast-export does the special handling of rewriting based
on grafts and replace objects.)  So, I'd say this is fully supported.

Side question: the git-replace documents suggest that the graft file
is deprecated.  Are there any timeframes or plans for phasing out
beyond the git-replace manpage existing?  Should I avoid documenting
the graft file support in filter-repo?  Should I include examples
using not just git-replace but also using the graft file?

> I would be _delighted_ if these kinds of use case (rewriting a branch,
> or even just a commit range) became more of a first-class citizen with
> `git filter-repo`.

I've got all the pieces for supporting a single branch or a commit
range (e.g. 'git filter-repo --path foo --refs ^master~4 ^stable~23
mybranch'), but the defaults (error out unless in a bare repo, move
refs/remotes/origin/* to refs/heads/*, disconnect origin remote,
expire reflogs & repack & prune, create new replace references so
folks can access new commits using old commit IDs) may be somewhat
friction-filled for this usecase.  Those defaults other than the new
replace refs happen to all be turned off with the combination of
--force and --target, so, assuming turning them off is what you need,
you could cheat and just specify 'git filter-repo --force --target .
--refs $MYBRANCH' today and perhaps get what you want, but that's a
really non-intuitive command line that is way too ugly to recommend.
And I don't want to tie myself to '--target .' being the magic sauce
in the future either.
Johannes Schindelin Sept. 2, 2019, 9:29 a.m. UTC | #11
Hi Elijah,

On Fri, 30 Aug 2019, Elijah Newren wrote:

> On Fri, Aug 30, 2019 at 1:40 PM Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
>
> > [...]
> > In my most recent instance of this, I wanted to publish the script I
> > used to use for submitting patch series to the Git mailing list,
> > maintaining tags for iterations and generating cover letters from branch
> > descriptions and interdiffs (this script eventually became GitGitGadget,
> > https://github.com/gitgitgadget/gitgitgadget/commits?after=6fb0ede48f86e729292ee1542729bc0f5a30cfa6+0
> > demonstrates this).
> >
> > To do that, I ran a `git filter-branch` in the repository where I track
> > all the scripts I deem unsuitable for public consumption, to remove all
> > files but `mail-patch-series.sh`, then pushed it to
> > https://github.com/dscho/mail-patch-series
> >
> > Please note that most crucially, I wanted to rewrite a newly-created
> > branch, and only that branch.
> >
> > Could I have done the same using `git fast-export`, filtering the output
> > with a Perl script, then passing it to `git fast-import`? Sure, I was
> > really tempted to do that. In the end, it took less of _my_ time to just
> > let `git filter-branch` do its work with a not-too-complicated index
> > filter.
>
> Why a perl script?  Shouldn't
>     git fast-export [--no-data] HEAD -- $PATH | git fast-import --force --quiet
> do the trick?  And it's probably simpler and shorter than the index
> filter you used.

Does that not keep the full `$PATH`? I wanted the resulting branch to
have the file in the top-level directory.

> That said, yeah it'd be nice to get automatic rewriting of commit
> hashes in commit messages and other niceties from filter-repo (e.g.
> future automatic reattaching of notes to the rewritten commits).  Some
> questions:
>
>   * What's the backup strategy in case you specify the wrong filters
> (e.g. you have a typo in the pathnames)?  filter-repo encourages folks
> to make a clone and then filter the fresh clone, because if anything
> goes awry, you can just delete and restart.  (I am heavily opposed to
> the refs/original/ backup mechanism used by filter-branch, for
> multiple reasons.)  Is your safety stance just "If I mess up it's my
> own fault; do the rewrite?"  Or are you okay with cloning before
> filtering?

Please note that the `refs/original/` refs should not have been written
at all anymore, not after reflogs were introduced.

Incidentally, that is my answer to your question: the reflog is my
backup.

>   * If you're okay with cloning before filtering...then is there an
> issue with rewriting all branches, and just pushing the one you need?
> (Is there an issue with "this branch is small, the others are huge,
> and filter-branch is slow -- so rewriting one branch saves me lots of
> time"?  Or are there other issues at play too?)

I am not okay with cloning before filtering.

First of all, it is wasteful.

Second of all, in my case it would have been *particularly* wasteful
because the repository in question also has quite a few quite large
blobs (hysterical raisins, don't ask).

>   * What if the user has auxiliary information for the branch in other
> refs?  For example, git-notes pointing at any of the commits, or tags
> in the history of the branch that might be relevant, or perhaps even
> replace refs in combination with GIT_NO_REPLACE_OBJECTS=1?  Is this an
> "I don't care, toss that stuff and just rewrite just this branch?"

In my case: there are no notes. The only time when I make heavy use of
notes is in GitGitGadget. I don't use that feature otherwise.

>   * filter-repo by default creates new replace references so that you
> can refer to new commit IDs using old (unabbreviated) commit IDs.
> Would that be considered helpful for this usecase?  unhelpful?
> irrelevant, since you'll just push the branch you want somewhere and
> nuke the temporary clone?

I definitely did not need that mapping in all of my `git filter-branch`
use cases.

Of course, I can see how it can come in handy in other circumstances,
just not in the ones I experienced so far.

> I'm not by any means ruling out the possibility of documenting --refs
> and adjusting the defaults when it is used so the user can just run
> something like
>    git filter-repo --path $PATH --refs $MYBRANCH
> but I feel like I need to understand answers to questions like the
> above ones so that I can know how to phrase warnings and adjust
> defaults and update the documentation.

In all the scenarios where I used `git filter-branch` (some dozen per
year, so not all *that* many), I needed to rewrite one particular
branch, typically a freshly-created one. I never, ever ever needed to
rewrite all the refs in the repository. Not once ;-)

> > In another instance, a long, long time ago, I needed to restart a
> > repository which had included way too many files for its own good, then
> > rename the old repository and start with a fresh `master` that contained
> > but a single commit whose tree was identical to the previous `master`'s
> > tip commit. I simply grafted that commit, ran `git filter-branch` and
> > had precisely what I needed.
>
> filter-repo supports grafts and replace objects, the same as
> filter-branch.  (Although, technically, I didn't have to do a thing to
> support it; fast-export does the special handling of rewriting based
> on grafts and replace objects.)  So, I'd say this is fully supported.
>
> Side question: the git-replace documents suggest that the graft file
> is deprecated.  Are there any timeframes or plans for phasing out
> beyond the git-replace manpage existing?  Should I avoid documenting
> the graft file support in filter-repo?  Should I include examples
> using not just git-replace but also using the graft file?

I had meant to prepare a patch series to remove `grafts` support that
Junio could carry in `pu` until the time he considers it appropriate to
merge to `master`, but it seems that this task fell under the rag.

The deprecation itself has been introduced in tags/v2.18.0-rc0~54^2~4,
i.e. it is official as of Git v2.18.0, which was released in mid-June
last year.

My personal gut feeling is that we should let it simmer for another year
before removing support for the `grafts` file (and we may want to update
the label "grafted" when `git log` shows a shallow commit before we
remove that support for `grafts`).

So I'll not work on that patch for now.

> > I would be _delighted_ if these kinds of use case (rewriting a branch,
> > or even just a commit range) became more of a first-class citizen with
> > `git filter-repo`.
>
> I've got all the pieces for supporting a single branch or a commit
> range (e.g. 'git filter-repo --path foo --refs ^master~4 ^stable~23
> mybranch'), but the defaults (error out unless in a bare repo, move
> refs/remotes/origin/* to refs/heads/*, disconnect origin remote,
> expire reflogs & repack & prune, create new replace references so
> folks can access new commits using old commit IDs) may be somewhat
> friction-filled for this usecase.  Those defaults other than the new
> replace refs happen to all be turned off with the combination of
> --force and --target, so, assuming turning them off is what you need,
> you could cheat and just specify 'git filter-repo --force --target .
> --refs $MYBRANCH' today and perhaps get what you want, but that's a
> really non-intuitive command line that is way too ugly to recommend.
> And I don't want to tie myself to '--target .' being the magic sauce
> in the future either.

I agree. I would love for my use cases to become more of first-class
citizens. Maybe `--branch <branch>` could serve as the knob?

What I also found really helpful in `git filter-branch` is that it was
possible to pass one-liner shell scripts directly to the command, giving
a lot of freedom about the transformations. I understand that Python
makes it hard to write spaghetti-code one-liners, so you cannot really
pass the snippet in via the command-line, but I hope there is a way to
script things in `git filter-repo`?

Ciao,
Dscho
Elijah Newren Sept. 3, 2019, 5:37 p.m. UTC | #12
Hi Dscho,

On Mon, Sep 2, 2019 at 2:30 AM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Elijah,
>
> On Fri, 30 Aug 2019, Elijah Newren wrote:
>
> > On Fri, Aug 30, 2019 at 1:40 PM Johannes Schindelin
> > <Johannes.Schindelin@gmx.de> wrote:
> >
> > > [...]
> > > In my most recent instance of this, I wanted to publish the script I
> > > used to use for submitting patch series to the Git mailing list,
> > > maintaining tags for iterations and generating cover letters from branch
> > > descriptions and interdiffs (this script eventually became GitGitGadget,
> > > https://github.com/gitgitgadget/gitgitgadget/commits?after=6fb0ede48f86e729292ee1542729bc0f5a30cfa6+0
> > > demonstrates this).
> > >
> > > To do that, I ran a `git filter-branch` in the repository where I track
> > > all the scripts I deem unsuitable for public consumption, to remove all
> > > files but `mail-patch-series.sh`, then pushed it to
> > > https://github.com/dscho/mail-patch-series
> > >
> > > Please note that most crucially, I wanted to rewrite a newly-created
> > > branch, and only that branch.
> > >
> > > Could I have done the same using `git fast-export`, filtering the output
> > > with a Perl script, then passing it to `git fast-import`? Sure, I was
> > > really tempted to do that. In the end, it took less of _my_ time to just
> > > let `git filter-branch` do its work with a not-too-complicated index
> > > filter.
> >
> > Why a perl script?  Shouldn't
> >     git fast-export [--no-data] HEAD -- $PATH | git fast-import --force --quiet
> > do the trick?  And it's probably simpler and shorter than the index
> > filter you used.
>
> Does that not keep the full `$PATH`? I wanted the resulting branch to
> have the file in the top-level directory.

Ah, gotcha; I read your original description to suggest that the
script was already at the toplevel.

> > That said, yeah it'd be nice to get automatic rewriting of commit
> > hashes in commit messages and other niceties from filter-repo (e.g.
> > future automatic reattaching of notes to the rewritten commits).  Some
> > questions:
> >
> >   * What's the backup strategy in case you specify the wrong filters
> > (e.g. you have a typo in the pathnames)?  filter-repo encourages folks
> > to make a clone and then filter the fresh clone, because if anything
> > goes awry, you can just delete and restart.  (I am heavily opposed to
> > the refs/original/ backup mechanism used by filter-branch, for
> > multiple reasons.)  Is your safety stance just "If I mess up it's my
> > own fault; do the rewrite?"  Or are you okay with cloning before
> > filtering?
>
> Please note that the `refs/original/` refs should not have been written
> at all anymore, not after reflogs were introduced.
>
> Incidentally, that is my answer to your question: the reflog is my
> backup.

The reflog is great, but while it works in your special case please note that:

1. Anyone filtering a subset of refs more in number than one may have
some difficulty restoring correctly (they have to look in several
reflogs, and can't script restoring from all older reflog versions).
2. Few folks have core.logAllRefUpdates set to 'always', meaning
they'll lack a backup for some refs if the reflog is relied upon.
3. If the filter specifies only keeping a list of files that happen to
not exist within one of the branches (perhaps a filename was typo'ed)
and if pruning empty commits, then the branch can be deleted, and git
doesn't have a mechanism for deleting a branch without deleting its
reflog as far as I know.

Point 1 is kind of minor, but points 2 and 3 are showstoppers in
regards to me recommending the reflogs as a reliable recovery
mechanism after general filtering operations, and this is true for
either filter-branch or filter-repo.  (That said, I can definitely
allow people to choose their risks and just provide some
here-be-dragons warnings.)

> >   * If you're okay with cloning before filtering...then is there an
> > issue with rewriting all branches, and just pushing the one you need?
> > (Is there an issue with "this branch is small, the others are huge,
> > and filter-branch is slow -- so rewriting one branch saves me lots of
> > time"?  Or are there other issues at play too?)
>
> I am not okay with cloning before filtering.
>
> First of all, it is wasteful.
>
> Second of all, in my case it would have been *particularly* wasteful
> because the repository in question also has quite a few quite large
> blobs (hysterical raisins, don't ask).
>
> >   * What if the user has auxiliary information for the branch in other
> > refs?  For example, git-notes pointing at any of the commits, or tags
> > in the history of the branch that might be relevant, or perhaps even
> > replace refs in combination with GIT_NO_REPLACE_OBJECTS=1?  Is this an
> > "I don't care, toss that stuff and just rewrite just this branch?"
>
> In my case: there are no notes. The only time when I make heavy use of
> notes is in GitGitGadget. I don't use that feature otherwise.
>
> >   * filter-repo by default creates new replace references so that you
> > can refer to new commit IDs using old (unabbreviated) commit IDs.
> > Would that be considered helpful for this usecase?  unhelpful?
> > irrelevant, since you'll just push the branch you want somewhere and
> > nuke the temporary clone?
>
> I definitely did not need that mapping in all of my `git filter-branch`
> use cases.
>
> Of course, I can see how it can come in handy in other circumstances,
> just not in the ones I experienced so far.
>
> > I'm not by any means ruling out the possibility of documenting --refs
> > and adjusting the defaults when it is used so the user can just run
> > something like
> >    git filter-repo --path $PATH --refs $MYBRANCH
> > but I feel like I need to understand answers to questions like the
> > above ones so that I can know how to phrase warnings and adjust
> > defaults and update the documentation.
>
> In all the scenarios where I used `git filter-branch` (some dozen per
> year, so not all *that* many), I needed to rewrite one particular
> branch, typically a freshly-created one. I never, ever ever needed to
> rewrite all the refs in the repository. Not once ;-)

Thanks for answering all these and providing the extra context.  Very helpful.

> > > In another instance, a long, long time ago, I needed to restart a
> > > repository which had included way too many files for its own good, then
> > > rename the old repository and start with a fresh `master` that contained
> > > but a single commit whose tree was identical to the previous `master`'s
> > > tip commit. I simply grafted that commit, ran `git filter-branch` and
> > > had precisely what I needed.
> >
> > filter-repo supports grafts and replace objects, the same as
> > filter-branch.  (Although, technically, I didn't have to do a thing to
> > support it; fast-export does the special handling of rewriting based
> > on grafts and replace objects.)  So, I'd say this is fully supported.
> >
> > Side question: the git-replace documents suggest that the graft file
> > is deprecated.  Are there any timeframes or plans for phasing out
> > beyond the git-replace manpage existing?  Should I avoid documenting
> > the graft file support in filter-repo?  Should I include examples
> > using not just git-replace but also using the graft file?
>
> I had meant to prepare a patch series to remove `grafts` support that
> Junio could carry in `pu` until the time he considers it appropriate to
> merge to `master`, but it seems that this task fell under the rag.
>
> The deprecation itself has been introduced in tags/v2.18.0-rc0~54^2~4,
> i.e. it is official as of Git v2.18.0, which was released in mid-June
> last year.
>
> My personal gut feeling is that we should let it simmer for another year
> before removing support for the `grafts` file (and we may want to update
> the label "grafted" when `git log` shows a shallow commit before we
> remove that support for `grafts`).
>
> So I'll not work on that patch for now.

Thanks for the extra history.

> > > I would be _delighted_ if these kinds of use case (rewriting a branch,
> > > or even just a commit range) became more of a first-class citizen with
> > > `git filter-repo`.
> >
> > I've got all the pieces for supporting a single branch or a commit
> > range (e.g. 'git filter-repo --path foo --refs ^master~4 ^stable~23
> > mybranch'), but the defaults (error out unless in a bare repo, move
> > refs/remotes/origin/* to refs/heads/*, disconnect origin remote,
> > expire reflogs & repack & prune, create new replace references so
> > folks can access new commits using old commit IDs) may be somewhat
> > friction-filled for this usecase.  Those defaults other than the new
> > replace refs happen to all be turned off with the combination of
> > --force and --target, so, assuming turning them off is what you need,
> > you could cheat and just specify 'git filter-repo --force --target .
> > --refs $MYBRANCH' today and perhaps get what you want, but that's a
> > really non-intuitive command line that is way too ugly to recommend.
> > And I don't want to tie myself to '--target .' being the magic sauce
> > in the future either.
>
> I agree. I would love for my use cases to become more of first-class
> citizens. Maybe `--branch <branch>` could serve as the knob?

I think I can put something together to make your usecases better.  I
dislike --branch, though, because:
  * It suggests it doesn't work for tags or other things outside of refs/heads/*
  * It suggests that revision ranges are unwelcome.
So, I'd prefer a more generic --refs which can potentially take
multiple arguments, e.g. any of
  --refs mybranch
  --refs mytag
  --refs HEAD~5..mybranch
  --refs ^origin/master ^origin/other-feature mybranch1 mybranch2

> What I also found really helpful in `git filter-branch` is that it was
> possible to pass one-liner shell scripts directly to the command, giving
> a lot of freedom about the transformations. I understand that Python
> makes it hard to write spaghetti-code one-liners, so you cannot really
> pass the snippet in via the command-line, but I hope there is a way to
> script things in `git filter-repo`?

I agree, having the ability to use a programming language snippet or
even full script for special cases is really nice. You can totally do
that filter-repo, at three different levels:

1. Light-control for easier cases: You can use the command line
arguments --filename-callback, --message-callback, --name-callback,
--email-callback, or --refname-callback and provide python snippets
(usually one-liners) that return a new value.  Most of these are for
editing fields shared across multiple object types (e.g. the name
callback will be used to edit author && committer && tagger names).
However, the filename callback allows both editing the filename and
also filtering based upon filename (you can return the original
filename, OR a new name, OR you can return None to state that you want
files with that name filtered out of commits).

2. Moderate-control: You can use the command line arguments
--blob-callback, --commit-callback, --tag-callback, or
--reset-callback and provide python snippets (possibly one-liners but
more likely to be complex enough that you want newlines) that modify
these fast-import-stream objects.  These provide more control but tend
to be slightly more work. (For example, if you want to rename
branches, you need to worry about three callbacks: commit && tag &&
reset; by comparison, you'd only need to use the refname callback from
lighter control.  Also, if you're worried about filtering based on
filenames, you'll need to dig the filenames (and modes and change
types) out of the list commit.file_changes).

3. All-in: You can write a python script that imports filter-repo as a
python module and (among other things) set up your own
functions/classes as callbacks.  You have to do a bit more setup to
specify the options you are running with, list how many export and
import processes you want to run with, name all your callbacks, etc.,
but it allows you to do anything from just providing a slightly more
involved callback up to and including creating your own filtering tool
with a totally different user interface while still leveraging
filter-repo's capability.  There are multiple examples provided along
that range too (including bfg-ish and
filter-lamely/filter-branch-ish.)

For more details about all of these, see:
https://github.com/newren/git-filter-repo/blob/a6a6a1b0/README.md#callbacks
https://github.com/newren/git-filter-repo/blob/a6a6a1b0/README.md#using-filter-repo-as-a-library