From patchwork Mon Jun 22 21:47:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11619187 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8CBCE138C for ; Mon, 22 Jun 2020 21:48:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 79EDE20767 for ; Mon, 22 Jun 2020 21:48:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730686AbgFVVsB (ORCPT ); Mon, 22 Jun 2020 17:48:01 -0400 Received: from cloud.peff.net ([104.130.231.41]:39336 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727049AbgFVVsB (ORCPT ); Mon, 22 Jun 2020 17:48:01 -0400 Received: (qmail 1917 invoked by uid 109); 22 Jun 2020 21:48:00 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Mon, 22 Jun 2020 21:48:00 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 8573 invoked by uid 111); 22 Jun 2020 21:48:00 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Mon, 22 Jun 2020 17:48:00 -0400 Authentication-Results: peff.net; auth=none Date: Mon, 22 Jun 2020 17:47:59 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin Subject: [PATCH v2 1/4] fast-export: allow dumping the refname mapping Message-ID: <20200622214759.GA3303964@coredump.intra.peff.net> References: <20200622214745.GA3302779@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200622214745.GA3302779@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org After you anonymize a repository, it can be hard to find which commits correspond between the original and the result, and thus hard to reproduce commands that triggered bugs in the original. Let's make it possible to dump the mapping separate from the output stream. This can be used by a bug reporter to modify their reproduction recipe without revealing the original names (see the example in the documentation). The implementation is slightly non-obvious. There's no point in the program where we know the complete set of refs we're going to anonymize. Nor do we have a complete set of anonymized refs after finishing (we have a set of anonymized ref path components, but no knowledge of how those are assembled into complete refs). So we lazily write to the dump file as we anonymize each name, and keep a list of ones that we've output in order to avoid duplicates. Some possible alternatives: - we could just output the mapping of anonymized components (e.g., that "foo" became "ref123"). That works OK when you have short refnames (e.g., "refs/heads/foo" becomes "refs/heads/ref123"), but longer names would require the user to look up each component to assemble the result. For example, "refs/remotes/origin/jk/foo" might become "refs/remotes/refs37/refs56/refs102". - instead of dumping the mapping, the same problem could be solved by allowing the user to leave some refs alone. So if you want to reproduce "git rev-list branch~17..HEAD" in the anonymized repo, we could allow something like: git tag anon-one branch git tag anon-two HEAD git fast-export --anonymize --all \ --no-anonymize-ref=anon-one \ --no-anonymize-ref=anon-two \ >stream and then presumably "git rev-list anon-one~17..anon-two" would behave the same in the re-imported repository. This is more convenient in some ways, but it does require modifying the original repository. And the concept doesn't easily extend to other fields (e.g., pathnames, which will be addressed in a subsequent patch). - we could dump before/after commit hashes; combined with rev-parse, that could convert these cases (as well as ones using raw hashes). But we don't actually know the anonymized commit hashes; we're just generating a stream that will produce them in the anonymized repo. - likewise, we probably could insert object names or other markers into commit messages, blob contents, etc, in order to let a user with the original repo figure out which parts correspond. But using this gets complicated (I have to find my commits in the result with "git log --all --grep" or similar). It also makes it less clear that the anonymized repo didn't leak any information (because we are relying on object ids being unguessable). Signed-off-by: Jeff King --- Documentation/git-fast-export.txt | 22 ++++++++++++++++++++ builtin/fast-export.c | 34 +++++++++++++++++++++++++++++++ t/t9351-fast-export-anonymize.sh | 14 +++++++++++++ 3 files changed, 70 insertions(+) diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index e8950de3ba..e809bb3f18 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -119,6 +119,12 @@ by keeping the marks the same across runs. the shape of the history and stored tree. See the section on `ANONYMIZING` below. +--dump-anonymized-refnames=:: + Output the mapping of real refnames to anonymized refnames to + . The output will contain one line per ref that appears in + the output stream, with the original refname, a space, and its + anonymized counterpart. See the section on `ANONYMIZING` below. + --reference-excluded-parents:: By default, running a command such as `git fast-export master~5..master` will not include the commit master{tilde}5 @@ -238,6 +244,22 @@ collapse "User 0", "User 1", etc into "User X"). This produces a much smaller output, and it is usually easy to quickly confirm that there is no private data in the stream. +Reproducing some bugs may require referencing particular commits, which +becomes challenging after the refnames have all been anonymized. You can +use `--dump-anonymized-refnames` to output the mapping, and then alter +your reproduction recipe to use the anonymized names. E.g., if you find +a bug with `git rev-list v1.0..v2.0` in the private repository, you can +run: + +--------------------------------------------------- +$ git fast-export --anonymize --all --dump-anonymized-refnames=refs.out >stream +$ grep '^refs/tags/v[12].0' refs.out +refs/tags/v1.0 refs/tags/ref31 +refs/tags/v2.0 refs/tags/ref50 +--------------------------------------------------- + +which tells you that `git rev-list ref31..ref50` may produce the same +bug in the re-imported anonymous repository. LIMITATIONS ----------- diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 85868162ee..844726d45a 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -24,6 +24,7 @@ #include "remote.h" #include "blob.h" #include "commit-slab.h" +#include "khash.h" static const char *fast_export_usage[] = { N_("git fast-export [rev-list-opts]"), @@ -45,6 +46,7 @@ static struct string_list extra_refs = STRING_LIST_INIT_NODUP; static struct string_list tag_refs = STRING_LIST_INIT_NODUP; static struct refspec refspecs = REFSPEC_INIT_FETCH; static int anonymize; +static FILE *anonymized_refnames_handle; static struct revision_sources revision_sources; static int parse_opt_signed_tag_mode(const struct option *opt, @@ -118,6 +120,23 @@ static int has_unshown_parent(struct commit *commit) return 0; } +KHASH_INIT(strset, const char *, int, 0, kh_str_hash_func, kh_str_hash_equal); + +struct seen_set { + kh_strset_t *set; +}; + +static int check_and_mark_seen(struct seen_set *seen, const char *str) +{ + int hashret; + if (!seen->set) + seen->set = kh_init_strset(); + if (kh_get_strset(seen->set, str) < kh_end(seen->set)) + return 1; + kh_put_strset(seen->set, xstrdup(str), &hashret); + return 0; +} + struct anonymized_entry { struct hashmap_entry hash; const char *orig; @@ -515,6 +534,8 @@ static const char *anonymize_refname(const char *refname) }; static struct hashmap refs; static struct strbuf anon = STRBUF_INIT; + static struct seen_set seen; + const char *full_refname = refname; int i; /* @@ -533,6 +554,12 @@ static const char *anonymize_refname(const char *refname) } anonymize_path(&anon, refname, &refs, anonymize_ref_component); + + if (anonymized_refnames_handle && + !check_and_mark_seen(&seen, full_refname)) + fprintf(anonymized_refnames_handle, "%s %s\n", + full_refname, anon.buf); + return anon.buf; } @@ -1144,6 +1171,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) char *export_filename = NULL, *import_filename = NULL, *import_filename_if_exists = NULL; + const char *anonymized_refnames_file = NULL; uint32_t lastimportid; struct string_list refspecs_list = STRING_LIST_INIT_NODUP; struct string_list paths_of_changed_objects = STRING_LIST_INIT_DUP; @@ -1177,6 +1205,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"), N_("Apply refspec to exported refs")), OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")), + OPT_STRING(0, "dump-anonymized-refnames", + &anonymized_refnames_file, N_("file"), + N_("output anonymized refname mapping to ")), OPT_BOOL(0, "reference-excluded-parents", &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")), OPT_BOOL(0, "show-original-ids", &show_original_ids, @@ -1213,6 +1244,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) string_list_clear(&refspecs_list, 1); } + if (anonymized_refnames_file) + anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w"); + if (use_done_feature) printf("feature done\n"); diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh index 897dc50907..75cbc7b329 100755 --- a/t/t9351-fast-export-anonymize.sh +++ b/t/t9351-fast-export-anonymize.sh @@ -46,6 +46,20 @@ test_expect_success 'stream omits tag message' ' ! grep "annotated tag" stream ' +test_expect_success 'refname mapping can be dumped' ' + git fast-export --anonymize --all \ + --dump-anonymized-refnames=refs.out >/dev/null && + # we make no guarantees of the exact anonymized names, + # so just check that we have the right number and + # that a sample line looks sane. + expected_count=$(git for-each-ref | wc -l) && + # Note that master is not anonymized, and so not included + # in the mapping. + expected_count=$((expected_count - 1)) && + test_line_count = $expected_count refs.out && + grep "^refs/heads/other refs/heads/" refs.out +' + # NOTE: we chdir to the new, anonymized repository # after this. All further tests should assume this. test_expect_success 'import stream to new repository' ' From patchwork Mon Jun 22 21:48:02 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11619189 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EDDDE13B1 for ; Mon, 22 Jun 2020 21:48:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E093F2075A for ; Mon, 22 Jun 2020 21:48:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730794AbgFVVsI (ORCPT ); Mon, 22 Jun 2020 17:48:08 -0400 Received: from cloud.peff.net ([104.130.231.41]:39342 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727049AbgFVVsG (ORCPT ); Mon, 22 Jun 2020 17:48:06 -0400 Received: (qmail 1929 invoked by uid 109); 22 Jun 2020 21:48:03 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Mon, 22 Jun 2020 21:48:03 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 8585 invoked by uid 111); 22 Jun 2020 21:48:03 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Mon, 22 Jun 2020 17:48:03 -0400 Authentication-Results: peff.net; auth=none Date: Mon, 22 Jun 2020 17:48:02 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin Subject: [PATCH v2 2/4] fast-export: anonymize "master" refname Message-ID: <20200622214802.GB3303964@coredump.intra.peff.net> References: <20200622214745.GA3302779@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200622214745.GA3302779@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Running "fast-export --anonymize" will leave "refs/heads/master" untouched in the output, for two reasons: - it helped to have some known reference point between the original and anonymized repository - since it's historically the default branch name, it doesn't leak any information Now that we can ask fast-export to dump the anonymized ref mapping, we have a much better tool for the first one (because it works for _any_ ref, not just master). For the second, the notion of "default branch name" is likely to become configurable soon, at which point the name _does_ leak information. Let's drop this special case in preparation. Note that we have to adjust the test a bit, since it relied on using the name "master" in the anonymized repos. But this gives us a good opportunity to further test the new dumping feature. Signed-off-by: Jeff King --- builtin/fast-export.c | 7 ------- t/t9351-fast-export-anonymize.sh | 15 +++++---------- 2 files changed, 5 insertions(+), 17 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 844726d45a..faaab6c7e9 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -538,13 +538,6 @@ static const char *anonymize_refname(const char *refname) const char *full_refname = refname; int i; - /* - * We also leave "master" as a special case, since it does not reveal - * anything interesting. - */ - if (!strcmp(refname, "refs/heads/master")) - return refname; - strbuf_reset(&anon); for (i = 0; i < ARRAY_SIZE(prefixes); i++) { if (skip_prefix(refname, prefixes[i], &refname)) { diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh index 75cbc7b329..c726306c4d 100755 --- a/t/t9351-fast-export-anonymize.sh +++ b/t/t9351-fast-export-anonymize.sh @@ -26,11 +26,8 @@ test_expect_success 'stream omits path names' ' ! grep xyzzy stream ' -test_expect_success 'stream allows master as refname' ' - grep master stream -' - -test_expect_success 'stream omits other refnames' ' +test_expect_success 'stream omits refnames' ' + ! grep master stream && ! grep other stream && ! grep mytag stream ' @@ -53,9 +50,6 @@ test_expect_success 'refname mapping can be dumped' ' # so just check that we have the right number and # that a sample line looks sane. expected_count=$(git for-each-ref | wc -l) && - # Note that master is not anonymized, and so not included - # in the mapping. - expected_count=$((expected_count - 1)) && test_line_count = $expected_count refs.out && grep "^refs/heads/other refs/heads/" refs.out ' @@ -71,15 +65,16 @@ test_expect_success 'import stream to new repository' ' test_expect_success 'result has two branches' ' git for-each-ref --format="%(refname)" refs/heads >branches && test_line_count = 2 branches && - other_branch=$(grep -v refs/heads/master branches) + main_branch=$(sed -ne "s,refs/heads/master ,,p" ../refs.out) && + other_branch=$(sed -ne "s,refs/heads/other ,,p" ../refs.out) ' test_expect_success 'repo has original shape and timestamps' ' shape () { git log --format="%m %ct" --left-right --boundary "$@" } && (cd .. && shape master...other) >expect && - shape master...$other_branch >actual && + shape $main_branch...$other_branch >actual && test_cmp expect actual ' From patchwork Mon Jun 22 21:48:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11619191 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3D91D13B1 for ; Mon, 22 Jun 2020 21:48:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2CE852078E for ; Mon, 22 Jun 2020 21:48:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730770AbgFVVsI (ORCPT ); Mon, 22 Jun 2020 17:48:08 -0400 Received: from cloud.peff.net ([104.130.231.41]:39352 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730694AbgFVVsG (ORCPT ); Mon, 22 Jun 2020 17:48:06 -0400 Received: (qmail 1943 invoked by uid 109); 22 Jun 2020 21:48:06 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Mon, 22 Jun 2020 21:48:06 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 8591 invoked by uid 111); 22 Jun 2020 21:48:06 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Mon, 22 Jun 2020 17:48:06 -0400 Authentication-Results: peff.net; auth=none Date: Mon, 22 Jun 2020 17:48:05 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin Subject: [PATCH v2 3/4] fast-export: refactor path printing to not rely on stdout Message-ID: <20200622214805.GC3303964@coredump.intra.peff.net> References: <20200622214745.GA3302779@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200622214745.GA3302779@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org We'll be using print_path_1() in more places in a subsequent patch, so let's teach it to take the output handle as a parameter. Signed-off-by: Jeff King --- builtin/fast-export.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index faaab6c7e9..aa7ac9761d 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -369,15 +369,15 @@ static int depth_first(const void *a_, const void *b_) return (a->status == 'R') - (b->status == 'R'); } -static void print_path_1(const char *path) +static void print_path_1(FILE *out, const char *path) { int need_quote = quote_c_style(path, NULL, NULL, 0); if (need_quote) - quote_c_style(path, NULL, stdout, 0); + quote_c_style(path, NULL, out, 0); else if (strchr(path, ' ')) - printf("\"%s\"", path); + fprintf(out, "\"%s\"", path); else - printf("%s", path); + fprintf(out, "%s", path); } static void *anonymize_path_component(const void *path, size_t *len) @@ -391,13 +391,13 @@ static void *anonymize_path_component(const void *path, size_t *len) static void print_path(const char *path) { if (!anonymize) - print_path_1(path); + print_path_1(stdout, path); else { static struct hashmap paths; static struct strbuf anon = STRBUF_INIT; anonymize_path(&anon, path, &paths, anonymize_path_component); - print_path_1(anon.buf); + print_path_1(stdout, anon.buf); strbuf_reset(&anon); } } From patchwork Mon Jun 22 21:48:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11619193 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 39E3F13B1 for ; Mon, 22 Jun 2020 21:48:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2C32420767 for ; Mon, 22 Jun 2020 21:48:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730805AbgFVVsN (ORCPT ); Mon, 22 Jun 2020 17:48:13 -0400 Received: from cloud.peff.net ([104.130.231.41]:39364 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730646AbgFVVsM (ORCPT ); Mon, 22 Jun 2020 17:48:12 -0400 Received: (qmail 1959 invoked by uid 109); 22 Jun 2020 21:48:10 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Mon, 22 Jun 2020 21:48:10 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 8611 invoked by uid 111); 22 Jun 2020 21:48:10 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Mon, 22 Jun 2020 17:48:10 -0400 Authentication-Results: peff.net; auth=none Date: Mon, 22 Jun 2020 17:48:09 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin Subject: [PATCH v2 4/4] fast-export: allow dumping the path mapping Message-ID: <20200622214809.GD3303964@coredump.intra.peff.net> References: <20200622214745.GA3302779@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200622214745.GA3302779@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org When working with an anonymized repo, it can be useful to be able to refer to particular paths. E.g., reproducing a bug with "git rev-list -- foo.c" in the original repo would need to replace "foo.c" with its anonymized counterpart to produce the same effect. We recently taught fast-export to dump the refname mapping. Let's do the same thing for paths, which can reuse most of the same infrastructure. We could also just introduce a "dump mapping" file that shows every mapping we make. But it would be a bit more awkward to work with, as the user would have to sort through more data to find the parts they're interested in (and there are likely to be many more paths than refnames, making it annoying for people who just want to dump the refnames). Signed-off-by: Jeff King --- Documentation/git-fast-export.txt | 12 ++++++++++++ builtin/fast-export.c | 16 ++++++++++++++++ t/t9351-fast-export-anonymize.sh | 21 +++++++++++++++++++-- 3 files changed, 47 insertions(+), 2 deletions(-) diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index e809bb3f18..342e34fd89 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -125,6 +125,14 @@ by keeping the marks the same across runs. the output stream, with the original refname, a space, and its anonymized counterpart. See the section on `ANONYMIZING` below. +--dump-anonymized-paths=:: + Output the mapping of real paths to anonymized paths to . + The output will contain one line per path that appears in the + output stream, with the original path, a space, and its + anonymized counterpart. Paths may be quoted if they contain a + space, or unusual characters; see `core.quotePath` in + linkgit:git-config(1). See also `ANONYMIZING` below. + --reference-excluded-parents:: By default, running a command such as `git fast-export master~5..master` will not include the commit master{tilde}5 @@ -261,6 +269,10 @@ refs/tags/v2.0 refs/tags/ref50 which tells you that `git rev-list ref31..ref50` may produce the same bug in the re-imported anonymous repository. +Likewise, `--dump-anonymized-paths` may be useful for a bug that +involves pathspecs. E.g., `git rev-list v1.0..v2.0 -- foo.c` requires +knowing the path corresponding to `foo.c` in the result. + LIMITATIONS ----------- diff --git a/builtin/fast-export.c b/builtin/fast-export.c index aa7ac9761d..080ded92e4 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -47,6 +47,7 @@ static struct string_list tag_refs = STRING_LIST_INIT_NODUP; static struct refspec refspecs = REFSPEC_INIT_FETCH; static int anonymize; static FILE *anonymized_refnames_handle; +static FILE *anonymized_paths_handle; static struct revision_sources revision_sources; static int parse_opt_signed_tag_mode(const struct option *opt, @@ -394,9 +395,18 @@ static void print_path(const char *path) print_path_1(stdout, path); else { static struct hashmap paths; + static struct seen_set seen; static struct strbuf anon = STRBUF_INIT; anonymize_path(&anon, path, &paths, anonymize_path_component); + if (anonymized_paths_handle && + !check_and_mark_seen(&seen, path)) { + print_path_1(anonymized_paths_handle, path); + fputc(' ', anonymized_paths_handle); + print_path_1(anonymized_paths_handle, anon.buf); + fputc('\n', anonymized_paths_handle); + } + print_path_1(stdout, anon.buf); strbuf_reset(&anon); } @@ -1165,6 +1175,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) *import_filename = NULL, *import_filename_if_exists = NULL; const char *anonymized_refnames_file = NULL; + const char *anonymized_paths_file = NULL; uint32_t lastimportid; struct string_list refspecs_list = STRING_LIST_INIT_NODUP; struct string_list paths_of_changed_objects = STRING_LIST_INIT_DUP; @@ -1201,6 +1212,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) OPT_STRING(0, "dump-anonymized-refnames", &anonymized_refnames_file, N_("file"), N_("output anonymized refname mapping to ")), + OPT_STRING(0, "dump-anonymized-paths", + &anonymized_paths_file, N_("file"), + N_("output anonymized path mapping to ")), OPT_BOOL(0, "reference-excluded-parents", &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")), OPT_BOOL(0, "show-original-ids", &show_original_ids, @@ -1239,6 +1253,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) if (anonymized_refnames_file) anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w"); + if (anonymized_paths_file) + anonymized_paths_handle = xfopen(anonymized_paths_file, "w"); if (use_done_feature) printf("feature done\n"); diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh index c726306c4d..ef72b1ec95 100755 --- a/t/t9351-fast-export-anonymize.sh +++ b/t/t9351-fast-export-anonymize.sh @@ -9,7 +9,7 @@ test_expect_success 'setup simple repo' ' git checkout -b other HEAD^ && mkdir subdir && test_commit subdir/bar && - test_commit subdir/xyzzy && + test_commit quoting "subdir/this needs quoting" && git tag -m "annotated tag" mytag ' @@ -23,7 +23,7 @@ test_expect_success 'stream omits path names' ' ! grep foo stream && ! grep subdir stream && ! grep bar stream && - ! grep xyzzy stream + ! grep quoting stream ' test_expect_success 'stream omits refnames' ' @@ -54,6 +54,23 @@ test_expect_success 'refname mapping can be dumped' ' grep "^refs/heads/other refs/heads/" refs.out ' +test_expect_success 'path mapping can be dumped' ' + git fast-export --anonymize --all \ + --dump-anonymized-paths=paths.out >/dev/null && + # as above, avoid depending on the exact scheme, but + # but check that we have the right number of mappings, + # and spot-check one sample. + expected_count=$( + git rev-list --objects --all | + git cat-file --batch-check="%(objecttype) %(rest)" | + sed -ne "s/^blob //p" | + sort -u | + wc -l + ) && + test_line_count = $expected_count paths.out && + grep "^\"subdir/this needs quoting\" " paths.out +' + # NOTE: we chdir to the new, anonymized repository # after this. All further tests should assume this. test_expect_success 'import stream to new repository' '