diff mbox series

[8/9] fast-export: respect the possibly-overridden default branch name

Message ID 1efe848f2b029e572cea61cadcfe36b9d3797836.1591823971.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Allow overriding the default name of the default branch | expand

Commit Message

Philip Peterson via GitGitGadget June 10, 2020, 9:19 p.m. UTC
From: Johannes Schindelin <johannes.schindelin@gmx.de>

When anonymizing commit history, we are careful to leave the branch name
of the default branch alone.

When the default branch name is overridden via the config or via the
environment variable, we will want `git fast-export` to use that
overridden name instead.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 builtin/fast-export.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Matt Rogers June 10, 2020, 9:54 p.m. UTC | #1
> -        * We also leave "master" as a special case, since it does not reveal
> -        * anything interesting.
> +        * We also leave the default branch name as a special case, since it
> +        * does not reveal anything interesting.
>          */
I feel this is a weird thing to do, since you're trying to anonymize the branch
name,and now the default branch is identifiable with your config file.  For
example, if the default branch contains the name of my project/repo then this
sounds like a recipe for accidentally sharing it. I feel a better
alternative would
be to exclude nothing from the anonymization or the proposed default default
branch name


> -       if (!strcmp(refname, "refs/heads/master"))
> +       if (!default_branch_name)
> +               default_branch_name = git_default_branch_name(0);
> +
> +       if (!strcmp(refname, default_branch_name))
>                 return refname;
>
>         strbuf_reset(&anon);
> --
> gitgitgadget
>
Junio C Hamano June 10, 2020, 11:25 p.m. UTC | #2
Matt Rogers <mattr94@gmail.com> writes:

>> -        * We also leave "master" as a special case, since it does not reveal
>> -        * anything interesting.
>> +        * We also leave the default branch name as a special case, since it
>> +        * does not reveal anything interesting.
>>          */
> I feel this is a weird thing to do, since you're trying to anonymize the branch
> name,and now the default branch is identifiable with your config file.  For
> example, if the default branch contains the name of my project/repo then this
> sounds like a recipe for accidentally sharing it. I feel a better
> alternative would
> be to exclude nothing from the anonymization or the proposed default default
> branch name

I wonder if anything bad happens if we keep *no* refs intact in this
function.  "Since it does not reveal anything interesting" is an
excuse why not munging it may be OK, but it does not explain why we
prefer not munging it actively.

If there is no reason to keep _some_ refs as-is, I agree that it is
perfectly sensible not to have this special case at all.

Thanks.

>> -       if (!strcmp(refname, "refs/heads/master"))
>> +       if (!default_branch_name)
>> +               default_branch_name = git_default_branch_name(0);
>> +
>> +       if (!strcmp(refname, default_branch_name))
>>                 return refname;
>>
>>         strbuf_reset(&anon);
>> --
>> gitgitgadget
>>
brian m. carlson June 10, 2020, 11:39 p.m. UTC | #3
On 2020-06-10 at 21:54:01, Matt Rogers wrote:
> > -        * We also leave "master" as a special case, since it does not reveal
> > -        * anything interesting.
> > +        * We also leave the default branch name as a special case, since it
> > +        * does not reveal anything interesting.
> >          */
> I feel this is a weird thing to do, since you're trying to anonymize the branch
> name,and now the default branch is identifiable with your config file.  For
> example, if the default branch contains the name of my project/repo then this
> sounds like a recipe for accidentally sharing it. I feel a better
> alternative would
> be to exclude nothing from the anonymization or the proposed default default
> branch name

I think this is fine because it only reveals the name of your particular
choice of default branch.  The goal of the --anonymize option is to
allow people to maintain the structure of their repositories while
stripping private information from them, primarily for debugging
purposes (e.g., providing to us for troubleshooting).

The things people want to prevent exposing are their code, data, project
names, user names, etc.: that is, anything identifying, privileged, or
private.  The default branch name isn't any of those things; we know you
have one, and for troubleshooting purposes, we aren't that interested in
what you called it.  You've almost certainly picked it out of a set of
one of 20 words that people use for this purpose, none of which are
private, and all of which are shared by millions of other repositories.

In the extremely unlikely case that it does matter, invoking git with
something like "-c default.branch=$(git hash-object /dev/null)" would be
sufficient to anonymize all branches.

I should point out that people frequently ask for the output of "git
config -l" for troubleshooting, and most people wouldn't consider their
default branch name to be worth sanitizing there.
Matt Rogers June 11, 2020, 12:20 a.m. UTC | #4
> I think this is fine because it only reveals the name of your particular
> choice of default branch.  The goal of the --anonymize option is to
> allow people to maintain the structure of their repositories while
> stripping private information from them, primarily for debugging
> purposes (e.g., providing to us for troubleshooting).
>
> The things people want to prevent exposing are their code, data, project
> names, user names, etc.: that is, anything identifying, privileged, or
> private.  The default branch name isn't any of those things; we know you
> have one, and for troubleshooting purposes, we aren't that interested in
> what you called it.  You've almost certainly picked it out of a set of
> one of 20 words that people use for this purpose, none of which are
> private, and all of which are shared by millions of other repositories.
>

I think that's not very convincing.  If branch names in general are identifying
enough to warrant anonymization then shouldn't the default name be too?

> In the extremely unlikely case that it does matter, invoking git with
> something like "-c default.branch=$(git hash-object /dev/null)" would be
> sufficient to anonymize all branches.
>
> I should point out that people frequently ask for the output of "git
> config -l" for troubleshooting, and most people wouldn't consider their
> default branch name to be worth sanitizing there.

I think this is a little presumptuous, most people wouldn't consider it to be
worth sanitizing because there isn't currently such a config setting.  If I give
you the the output of "git config -l" then I think it's obvious that all of my
configuration settings will be included (and therefore I can choose to sanitize
accordingly), but if I'm giving an exported repository I think should be
anonymized, but my default branch, which someone could innocently base on a
project or company name, could easily be accidentally included in that output
which could lead to a frustrating experience


> --
> brian m. carlson: Houston, Texas, US
> OpenPGP: https://keybase.io/bk2204
Junio C Hamano June 11, 2020, 5:26 a.m. UTC | #5
Matt Rogers <mattr94@gmail.com> writes:

> I think that's not very convincing.  If branch names in general are identifying
> enough to warrant anonymization then shouldn't the default name be too?

It is a good argument.  I also heard a rumor that often branch names
contain codewords given to pre-released hardware that are highly
confidential in certain circles, and heard that it is one of the
reasons why Gerrit has server side ACL that lets you hide some
branches from authenticated users that can access other branches.

Again, the original comment explains why 'master' without such a
configuration knob was not worth protecting, but what it does not
explain is why keeping it (and only that branch name) unmunged gives
a more useful result than munging everything.  From the point of
view of "I want to debug the shape of the DAG, without the actual
user data", munging 'master' to 'ref47' while other branches like
'next' are munged to 'ref%d' does not make it harder to use or less
useful for the debugging than only 'master' is kept intact in the
output stream.
Johannes Schindelin June 11, 2020, 1:57 p.m. UTC | #6
Hi Matt,

On Wed, 10 Jun 2020, Matt Rogers wrote:

> > -        * We also leave "master" as a special case, since it does not reveal
> > -        * anything interesting.
> > +        * We also leave the default branch name as a special case, since it
> > +        * does not reveal anything interesting.
> >          */
> I feel this is a weird thing to do, since you're trying to anonymize the branch
> name,and now the default branch is identifiable with your config file.  For
> example, if the default branch contains the name of my project/repo then this
> sounds like a recipe for accidentally sharing it. I feel a better
> alternative would
> be to exclude nothing from the anonymization or the proposed default default
> branch name

I don't think that the name of the main branch should be subject to
anonymizing, whether it be `master` or anything else.

Ciao,
Dscho
Johannes Schindelin June 11, 2020, 2:05 p.m. UTC | #7
Hi,

On Wed, 10 Jun 2020, Junio C Hamano wrote:

> Matt Rogers <mattr94@gmail.com> writes:
>
> > I think that's not very convincing.  If branch names in general are identifying
> > enough to warrant anonymization then shouldn't the default name be too?
>
> It is a good argument.  I also heard a rumor that often branch names
> contain codewords given to pre-released hardware that are highly
> confidential in certain circles, and heard that it is one of the
> reasons why Gerrit has server side ACL that lets you hide some
> branches from authenticated users that can access other branches.

Yes, branch names in general _can_ contain information users may prefer to
keep private.

However, we're not talking about branch names in general. We are talking
about the default name of the main branch, to be picked in _all_ of your
new repositories.

> Again, the original comment explains why 'master' without such a
> configuration knob was not worth protecting, but what it does not
> explain is why keeping it (and only that branch name) unmunged gives
> a more useful result than munging everything.  From the point of
> view of "I want to debug the shape of the DAG, without the actual
> user data", munging 'master' to 'ref47' while other branches like
> 'next' are munged to 'ref%d' does not make it harder to use or less
> useful for the debugging than only 'master' is kept intact in the
> output stream.

Yes. And you're unlikely to configure the default name to be used for all
of your future `git init` operations to be something non-generic.

I am still highly doubtful of Matt's suggestion that it would be worth
protecting the default name of the main branch to be used for _each_ and
_any_ new repository.

Now, if you suggest that `git fast-export --anonymize` should either not
special-case the main branch, or at least have a configurable set of names
it skips from protecting, then I will be much more in favor of those
suggestions. However, those suggestions are quite a bit orthogonal to the
patch series at hand, so I would want to discuss them in their own code
contribution instead of here.

Ciao,
Dscho
Junio C Hamano June 11, 2020, 6:19 p.m. UTC | #8
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi Matt,
>
> On Wed, 10 Jun 2020, Matt Rogers wrote:
>
>> > -        * We also leave "master" as a special case, since it does not reveal
>> > -        * anything interesting.
>> > +        * We also leave the default branch name as a special case, since it
>> > +        * does not reveal anything interesting.
>> >          */
>> I feel this is a weird thing to do, since you're trying to anonymize the branch
>> name,and now the default branch is identifiable with your config file.  For
>> example, if the default branch contains the name of my project/repo then this
>> sounds like a recipe for accidentally sharing it. I feel a better
>> alternative would
>> be to exclude nothing from the anonymization or the proposed default default
>> branch name
>
> I don't think that the name of the main branch should be subject to
> anonymizing, whether it be `master` or anything else.

"Here is why" is missing ;-)  I think you realized that it needs to
be, after you wrote the "ah, we need two, the default for new ones
and the name of the primary branch in a particular repository", as
we are dealing with the latter here.

Thanks.
Johannes Schindelin June 12, 2020, 12:07 p.m. UTC | #9
Hi Junio,

On Thu, 11 Jun 2020, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > Hi Matt,
> >
> > On Wed, 10 Jun 2020, Matt Rogers wrote:
> >
> >> > -        * We also leave "master" as a special case, since it does not reveal
> >> > -        * anything interesting.
> >> > +        * We also leave the default branch name as a special case, since it
> >> > +        * does not reveal anything interesting.
> >> >          */
> >> I feel this is a weird thing to do, since you're trying to anonymize the branch
> >> name,and now the default branch is identifiable with your config file.  For
> >> example, if the default branch contains the name of my project/repo then this
> >> sounds like a recipe for accidentally sharing it. I feel a better
> >> alternative would
> >> be to exclude nothing from the anonymization or the proposed default default
> >> branch name
> >
> > I don't think that the name of the main branch should be subject to
> > anonymizing, whether it be `master` or anything else.
>
> "Here is why" is missing ;-)  I think you realized that it needs to
> be, after you wrote the "ah, we need two, the default for new ones
> and the name of the primary branch in a particular repository", as
> we are dealing with the latter here.

Yep, absolutely.

Thanks,
Dscho
diff mbox series

Patch

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 85868162eec..028dd9969a2 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -515,13 +515,17 @@  static const char *anonymize_refname(const char *refname)
 	};
 	static struct hashmap refs;
 	static struct strbuf anon = STRBUF_INIT;
+	static char *default_branch_name;
 	int i;
 
 	/*
-	 * We also leave "master" as a special case, since it does not reveal
-	 * anything interesting.
+	 * We also leave the default branch name as a special case, since it
+	 * does not reveal anything interesting.
 	 */
-	if (!strcmp(refname, "refs/heads/master"))
+	if (!default_branch_name)
+		default_branch_name = git_default_branch_name(0);
+
+	if (!strcmp(refname, default_branch_name))
 		return refname;
 
 	strbuf_reset(&anon);