[v2,3/4] Recommend git-filter-repo instead of git-filter-branch
diff mbox series

Message ID 20190828002210.8862-4-newren@gmail.com
State New
Headers show
Series
  • Warn about git-filter-branch usage and avoid it
Related show

Commit Message

Elijah Newren Aug. 28, 2019, 12:22 a.m. UTC
filter-branch suffers from a deluge of disguised dangers that disfigure
history rewrites (i.e. deviate from the deliberate changes).  Many of
these problems are unobtrusive and can easily go undiscovered until the
new repository is in use.  This can result in problems ranging from an
even messier history than what led folks to filter-branch in the first
place, to data loss or corruption.  These issues cannot be backward
compatibly fixed, so add a warning to both filter-branch and its manpage
recommending that another tool (such as filter-repo) be used instead.

Also, update other manpages that referenced filter-branch.  Several of
these needed updates even if we could continue recommending
filter-branch, either due to implying that something was unique to
filter-branch when it applied more generally to all history rewriting
tools (e.g. BFG, reposurgeon, fast-import, filter-repo), or because
something about filter-branch was used as an example despite other more
commonly known examples now existing.  Reword these sections to fix
these issues and to avoid recommending filter-branch.

Finally, remove the section explaining BFG Repo Cleaner as an
alternative to filter-branch.  I feel somewhat bad about this,
especially since I feel like I learned so much from BFG that I put to
good use in filter-repo (which is much more than I can say for
filter-branch), but keeping that section presented a few problems:
  * In order to recommend that people quit using filter-branch, we need
    to provide them a recomendation for something else to use that
    can handle all the same types of rewrites.  To my knowledge,
    filter-repo is the only such tool.  So it needs to be mentioned.
  * I don't want to give conflicting recommendations to users
  * If we recommend two tools, we shouldn't expect users to learn both
    and pick which one to use; we should explain which problems one
    can solve that the other can't or when one is much faster than
    the other.
  * BFG and filter-repo have similar performance
  * All filtering types that BFG can do, filter-repo can also do.  In
    fact, filter-repo comes with a reimplementation of BFG named
    bfg-ish which provides the same user-interface as BFG but with
    several bugfixes and new features that are hard to implement in
    BFG due to its technical underpinnings.
While I could still mention both tools, it seems like I would need to
provide some kind of comparison and I would ultimately just say that
filter-repo can do everything BFG can, so ultimately it seems that it
is just better to remove that section altogether.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt   |  6 ++--
 Documentation/git-filter-branch.txt | 45 +++++++++--------------------
 Documentation/git-gc.txt            | 17 +++++------
 Documentation/git-rebase.txt        |  2 +-
 Documentation/git-replace.txt       | 10 +++----
 Documentation/git-svn.txt           |  4 +--
 Documentation/githooks.txt          |  7 +++--
 contrib/svn-fe/svn-fe.txt           |  4 +--
 git-filter-branch.sh                | 13 +++++++++
 9 files changed, 52 insertions(+), 56 deletions(-)
 mode change 100755 => 100644 git-filter-branch.sh

Comments

Eric Sunshine Aug. 28, 2019, 6:17 a.m. UTC | #1
On Tue, Aug 27, 2019 at 8:22 PM Elijah Newren <newren@gmail.com> wrote:
> filter-branch suffers from a deluge of disguised dangers that disfigure
> history rewrites (i.e. deviate from the deliberate changes). [...]
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
> @@ -16,6 +16,20 @@ SYNOPSIS
> +WARNING
> +-------
> +'git filter-branch' has a plethora of pitfalls that can produce non-obvious
> +manglings of the intended history rewrite (and can leave you with little
> +time to investigate such problems since it has such abysmal performance).
> +These safety and performance issues cannot be backward compatibly fixed and
> +as such, its use is not recommended.  Please use an alternative history
> +filtering tool such as https://github.com/newren/git-filter-repo/[git
> +filter-repo].  If you still need to use 'git filter-branch', please
> +carefully read the "Safety" section of the message on the Git mailing list
> +https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
> +the land mines of filter-branch] and vigilantly avoid as many of the
> +hazards listed there as reasonably possible.

Is there a good reason to not simply copy the "Safety" section from
that email directly into this document so that readers don't have to
go chasing down the link (especially those who are reading
documentation offline)?

> diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
> @@ -832,7 +832,7 @@ Hard case: The changes are not the same.::
>         This happens if the 'subsystem' rebase had conflicts, or used
>         `--interactive` to omit, edit, squash, or fixup commits; or
>         if the upstream used one of `commit --amend`, `reset`, or
> -       `filter-branch`.
> +       a full history rewriting command like `filter-repo`.

Do we want a clickable link to `filter-repo` here?

> diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
> @@ -123,10 +123,10 @@ The following format are available:
> +linkgit:git-hash-object[1], linkgit:git-rebase[1], and
> +linkgit:git-filter-repo[1], among other git commands, can be used to
> [...]
> @@ -148,8 +148,8 @@ pending objects.
>  linkgit:git-hash-object[1]
>  linkgit:git-rebase[1]
> +linkgit:git-filter-repo[1]

Are these 'linkgit:' references to `filter-repo` going to be
meaningful if that tool is not incorporated into the Git project
proper? Perhaps use a generic clickable link instead.

Same comment applies to other 'linkgit:' invocations in the remainder
of the patch.

> diff --git a/git-filter-branch.sh b/git-filter-branch.sh
> old mode 100755
> new mode 100644

Why lose the executable bit?

> @@ -83,6 +83,19 @@ set_ident () {
> +if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
> +     -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then

If this script didn't already have a mix of styles, I'd say something
about modern style being:

    if test -z "$FILTER_BRANCH_SQUELCH_WARNING" &&
        test -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS"
    then
        ...
    fi

> +       cat <<EOF
> +WARNING: git-filter-branch has a glut of gotchas generating mangled history
> +         rewrites.  Please use an alternative filtering tool such as 'git
> +         filter-repo' (https://github.com/newren/git-filter-repo/) instead.
> +         See the filter-branch manual page for more details; to squelch
> +         this warning and pause, set FILTER_BRANCH_SQUELCH_WARNING=1.

The "and pause" threw me. There's more than a bit of ambiguity
surrounding it. Perhaps drop it?
Elijah Newren Aug. 28, 2019, 9:48 p.m. UTC | #2
On Tue, Aug 27, 2019 at 11:17 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Tue, Aug 27, 2019 at 8:22 PM Elijah Newren <newren@gmail.com> wrote:
> > filter-branch suffers from a deluge of disguised dangers that disfigure
> > history rewrites (i.e. deviate from the deliberate changes). [...]
> >
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> > diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
> > @@ -16,6 +16,20 @@ SYNOPSIS
> > +WARNING
> > +-------
> > +'git filter-branch' has a plethora of pitfalls that can produce non-obvious
> > +manglings of the intended history rewrite (and can leave you with little
> > +time to investigate such problems since it has such abysmal performance).
> > +These safety and performance issues cannot be backward compatibly fixed and
> > +as such, its use is not recommended.  Please use an alternative history
> > +filtering tool such as https://github.com/newren/git-filter-repo/[git
> > +filter-repo].  If you still need to use 'git filter-branch', please
> > +carefully read the "Safety" section of the message on the Git mailing list
> > +https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
> > +the land mines of filter-branch] and vigilantly avoid as many of the
> > +hazards listed there as reasonably possible.
>
> Is there a good reason to not simply copy the "Safety" section from
> that email directly into this document so that readers don't have to
> go chasing down the link (especially those who are reading
> documentation offline)?

Makes sense, I can include it.  However, saying e.g. "the
git-filter-branch manpage is missing..." or "the git-filter-branch
manpage actually documents <crazy buggy behavior> as expected" feels
really weird to include on the git-filter-branch manpage.  I'll try to
touch it up.

> > diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
> > @@ -832,7 +832,7 @@ Hard case: The changes are not the same.::
> >         This happens if the 'subsystem' rebase had conflicts, or used
> >         `--interactive` to omit, edit, squash, or fixup commits; or
> >         if the upstream used one of `commit --amend`, `reset`, or
> > -       `filter-branch`.
> > +       a full history rewriting command like `filter-repo`.
>
> Do we want a clickable link to `filter-repo` here?

I guess it can't hurt.

> > diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
> > @@ -123,10 +123,10 @@ The following format are available:
> > +linkgit:git-hash-object[1], linkgit:git-rebase[1], and
> > +linkgit:git-filter-repo[1], among other git commands, can be used to
> > [...]
> > @@ -148,8 +148,8 @@ pending objects.
> >  linkgit:git-hash-object[1]
> >  linkgit:git-rebase[1]
> > +linkgit:git-filter-repo[1]
>
> Are these 'linkgit:' references to `filter-repo` going to be
> meaningful if that tool is not incorporated into the Git project
> proper? Perhaps use a generic clickable link instead.
>
> Same comment applies to other 'linkgit:' invocations in the remainder
> of the patch.

I'm fixing them up.

> > diff --git a/git-filter-branch.sh b/git-filter-branch.sh
> > old mode 100755
> > new mode 100644
>
> Why lose the executable bit?

Whoops.  Did some rebasing and fixups, then continued editing my
buffer of the file after one of the rebases, realized the file was
deleted (because of the final patch in the series), moved the file out
of the way and rebased again and copied the file back into place, and
forgot to check the filemode.

> > @@ -83,6 +83,19 @@ set_ident () {
> > +if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
> > +     -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
>
> If this script didn't already have a mix of styles, I'd say something
> about modern style being:
>
>     if test -z "$FILTER_BRANCH_SQUELCH_WARNING" &&
>         test -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS"
>     then
>         ...
>     fi
>
> > +       cat <<EOF
> > +WARNING: git-filter-branch has a glut of gotchas generating mangled history
> > +         rewrites.  Please use an alternative filtering tool such as 'git
> > +         filter-repo' (https://github.com/newren/git-filter-repo/) instead.
> > +         See the filter-branch manual page for more details; to squelch
> > +         this warning and pause, set FILTER_BRANCH_SQUELCH_WARNING=1.
>
> The "and pause" threw me. There's more than a bit of ambiguity
> surrounding it. Perhaps drop it?

Sure, will do.

Patch
diff mbox series

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index cc940eb9ad..784e934009 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -17,9 +17,9 @@  This program dumps the given revisions in a form suitable to be piped
 into 'git fast-import'.
 
 You can use it as a human-readable bundle replacement (see
-linkgit:git-bundle[1]), or as a kind of an interactive
-'git filter-branch'.
-
+linkgit:git-bundle[1]), or as a format that can be edited before being
+fed to 'git fast-import' in order to do history rewrites (an ability
+relied on by tools like 'git filter-repo').
 
 OPTIONS
 -------
diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
index 6b53dd7e06..e4047d472e 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -16,6 +16,20 @@  SYNOPSIS
 	[--original <namespace>] [-d <directory>] [-f | --force]
 	[--state-branch <branch>] [--] [<rev-list options>...]
 
+WARNING
+-------
+'git filter-branch' has a plethora of pitfalls that can produce non-obvious
+manglings of the intended history rewrite (and can leave you with little
+time to investigate such problems since it has such abysmal performance).
+These safety and performance issues cannot be backward compatibly fixed and
+as such, its use is not recommended.  Please use an alternative history
+filtering tool such as https://github.com/newren/git-filter-repo/[git
+filter-repo].  If you still need to use 'git filter-branch', please
+carefully read the "Safety" section of the message on the Git mailing list
+https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
+the land mines of filter-branch] and vigilantly avoid as many of the
+hazards listed there as reasonably possible.
+
 DESCRIPTION
 -----------
 Lets you rewrite Git revision history by rewriting the branches mentioned
@@ -445,37 +459,6 @@  warned.
   (or if your git-gc is not new enough to support arguments to
   `--prune`, use `git repack -ad; git prune` instead).
 
-NOTES
------
-
-git-filter-branch allows you to make complex shell-scripted rewrites
-of your Git history, but you probably don't need this flexibility if
-you're simply _removing unwanted data_ like large files or passwords.
-For those operations you may want to consider
-http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
-a JVM-based alternative to git-filter-branch, typically at least
-10-50x faster for those use-cases, and with quite different
-characteristics:
-
-* Any particular version of a file is cleaned exactly _once_. The BFG,
-  unlike git-filter-branch, does not give you the opportunity to
-  handle a file differently based on where or when it was committed
-  within your history. This constraint gives the core performance
-  benefit of The BFG, and is well-suited to the task of cleansing bad
-  data - you don't care _where_ the bad data is, you just want it
-  _gone_.
-
-* By default The BFG takes full advantage of multi-core machines,
-  cleansing commit file-trees in parallel. git-filter-branch cleans
-  commits sequentially (i.e. in a single-threaded manner), though it
-  _is_ possible to write filters that include their own parallelism,
-  in the scripts executed against each commit.
-
-* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
-  are much more restrictive than git-filter branch, and dedicated just
-  to the tasks of removing unwanted data- e.g:
-  `--strip-blobs-bigger-than 1M`.
-
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 247f765604..0c114ad1ca 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -115,15 +115,14 @@  NOTES
 -----
 
 'git gc' tries very hard not to delete objects that are referenced
-anywhere in your repository. In
-particular, it will keep not only objects referenced by your current set
-of branches and tags, but also objects referenced by the index,
-remote-tracking branches, refs saved by 'git filter-branch' in
-refs/original/, reflogs (which may reference commits in branches
-that were later amended or rewound), and anything else in the refs/* namespace.
-If you are expecting some objects to be deleted and they aren't, check
-all of those locations and decide whether it makes sense in your case to
-remove those references.
+anywhere in your repository. In particular, it will keep not only
+objects referenced by your current set of branches and tags, but also
+objects referenced by the index, remote-tracking branches, notes saved
+by 'git notes' under refs/notes/, reflogs (which may reference commits
+in branches that were later amended or rewound), and anything else in
+the refs/* namespace.  If you are expecting some objects to be deleted
+and they aren't, check all of those locations and decide whether it
+makes sense in your case to remove those references.
 
 On the other hand, when 'git gc' runs concurrently with another process,
 there is a risk of it deleting an object that the other process is using
diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 6156609cf7..2f201d85d4 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -832,7 +832,7 @@  Hard case: The changes are not the same.::
 	This happens if the 'subsystem' rebase had conflicts, or used
 	`--interactive` to omit, edit, squash, or fixup commits; or
 	if the upstream used one of `commit --amend`, `reset`, or
-	`filter-branch`.
+	a full history rewriting command like `filter-repo`.
 
 
 The easy case
diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
index 246dc9943c..35595a2cd3 100644
--- a/Documentation/git-replace.txt
+++ b/Documentation/git-replace.txt
@@ -123,10 +123,10 @@  The following format are available:
 CREATING REPLACEMENT OBJECTS
 ----------------------------
 
-linkgit:git-filter-branch[1], linkgit:git-hash-object[1] and
-linkgit:git-rebase[1], among other git commands, can be used to create
-replacement objects from existing objects. The `--edit` option can
-also be used with 'git replace' to create a replacement object by
+linkgit:git-hash-object[1], linkgit:git-rebase[1], and
+linkgit:git-filter-repo[1], among other git commands, can be used to
+create replacement objects from existing objects. The `--edit` option
+can also be used with 'git replace' to create a replacement object by
 editing an existing object.
 
 If you want to replace many blobs, trees or commits that are part of a
@@ -148,8 +148,8 @@  pending objects.
 SEE ALSO
 --------
 linkgit:git-hash-object[1]
-linkgit:git-filter-branch[1]
 linkgit:git-rebase[1]
+linkgit:git-filter-repo[1]
 linkgit:git-tag[1]
 linkgit:git-branch[1]
 linkgit:git-commit[1]
diff --git a/Documentation/git-svn.txt b/Documentation/git-svn.txt
index 30711625fd..f2762dd5d4 100644
--- a/Documentation/git-svn.txt
+++ b/Documentation/git-svn.txt
@@ -769,9 +769,9 @@  option for (hopefully) obvious reasons.
 +
 This option is NOT recommended as it makes it difficult to track down
 old references to SVN revision numbers in existing documentation, bug
-reports and archives.  If you plan to eventually migrate from SVN to Git
+reports, and archives.  If you plan to eventually migrate from SVN to Git
 and are certain about dropping SVN history, consider
-linkgit:git-filter-branch[1] instead.  filter-branch also allows
+linkgit:git-filter-repo[1] instead.  filter-repo also allows
 reformatting of metadata for ease-of-reading and rewriting authorship
 info for non-"svn.authorsFile" users.
 
diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt
index 82cd573776..997548f5ed 100644
--- a/Documentation/githooks.txt
+++ b/Documentation/githooks.txt
@@ -425,9 +425,10 @@  post-rewrite
 
 This hook is invoked by commands that rewrite commits
 (linkgit:git-commit[1] when called with `--amend` and
-linkgit:git-rebase[1]; currently `git filter-branch` does 'not' call
-it!).  Its first argument denotes the command it was invoked by:
-currently one of `amend` or `rebase`.  Further command-dependent
+linkgit:git-rebase[1]; however, full-history (re)writing tools like
+linkgit:git-fast-import[1] or linkgit:git-filter-repo[1] typically do
+not call it!).  Its first argument denotes the command it was invoked
+by: currently one of `amend` or `rebase`.  Further command-dependent
 arguments may be passed in the future.
 
 The hook receives a list of the rewritten commits on stdin, in the
diff --git a/contrib/svn-fe/svn-fe.txt b/contrib/svn-fe/svn-fe.txt
index a3425f4770..19333fc8df 100644
--- a/contrib/svn-fe/svn-fe.txt
+++ b/contrib/svn-fe/svn-fe.txt
@@ -56,7 +56,7 @@  line.  This line has the form `git-svn-id: URL@REVNO UUID`.
 
 The resulting repository will generally require further processing
 to put each project in its own repository and to separate the history
-of each branch.  The 'git filter-branch --subdirectory-filter' command
+of each branch.  The 'git filter-repo --subdirectory-filter' command
 may be useful for this purpose.
 
 BUGS
@@ -67,5 +67,5 @@  The exit status does not reflect whether an error was detected.
 
 SEE ALSO
 --------
-git-svn(1), svn2git(1), svk(1), git-filter-branch(1), git-fast-import(1),
+git-svn(1), svn2git(1), svk(1), git-filter-repo(1), git-fast-import(1),
 https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
diff --git a/git-filter-branch.sh b/git-filter-branch.sh
old mode 100755
new mode 100644
index 5c5afa2b98..7b1865c1d5
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -83,6 +83,19 @@  set_ident () {
 	finish_ident COMMITTER
 }
 
+if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
+     -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
+	cat <<EOF
+WARNING: git-filter-branch has a glut of gotchas generating mangled history
+         rewrites.  Please use an alternative filtering tool such as 'git
+         filter-repo' (https://github.com/newren/git-filter-repo/) instead.
+         See the filter-branch manual page for more details; to squelch
+         this warning and pause, set FILTER_BRANCH_SQUELCH_WARNING=1.
+
+EOF
+	sleep 5
+fi
+
 USAGE="[--setup <command>] [--subdirectory-filter <directory>] [--env-filter <command>]
 	[--tree-filter <command>] [--index-filter <command>]
 	[--parent-filter <command>] [--msg-filter <command>]