From patchwork Thu Jun 25 19:48:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626031 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5BCDF912 for ; Thu, 25 Jun 2020 19:48:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3E00F207D8 for ; Thu, 25 Jun 2020 19:48:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406822AbgFYTsQ (ORCPT ); Thu, 25 Jun 2020 15:48:16 -0400 Received: from cloud.peff.net ([104.130.231.41]:43244 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406573AbgFYTsP (ORCPT ); Thu, 25 Jun 2020 15:48:15 -0400 Received: (qmail 31325 invoked by uid 109); 25 Jun 2020 19:48:15 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:15 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19566 invoked by uid 111); 25 Jun 2020 19:48:16 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:16 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:14 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 01/11] t9351: derive anonymized tree checks from original repo Message-ID: <20200625194814.GA4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Our tests of the anonymized repo just hard-code the expected set of objects in the root and subdirectory trees. This makes them brittle to the test setup changing (e.g., adding new paths that need tested). Let's look at the original repo to compute our expected set of objects. Note that this isn't completely perfect (e.g., we still rely on there being only one tree in the root), but it does simplify later patches. Signed-off-by: Jeff King --- t/t9351-fast-export-anonymize.sh | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh index 897dc50907..e772cf9930 100755 --- a/t/t9351-fast-export-anonymize.sh +++ b/t/t9351-fast-export-anonymize.sh @@ -71,22 +71,18 @@ test_expect_success 'repo has original shape and timestamps' ' test_expect_success 'root tree has original shape' ' # the output entries are not necessarily in the same - # order, but we know at least that we will have one tree - # and one blob, so just check the sorted order - cat >expect <<-\EOF && - blob - tree - EOF + # order, but we should at least have the same set of + # object types. + git -C .. ls-tree HEAD >orig-root && + cut -d" " -f2 expect && git ls-tree $other_branch >root && cut -d" " -f2 actual && test_cmp expect actual ' test_expect_success 'paths in subdir ended up in one tree' ' - cat >expect <<-\EOF && - blob - blob - EOF + git -C .. ls-tree other:subdir >orig-subdir && + cut -d" " -f2 expect && tree=$(grep tree root | cut -f2) && git ls-tree $other_branch:$tree >tree && cut -d" " -f2 actual && From patchwork Thu Jun 25 19:48:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626033 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BE4D1912 for ; Thu, 25 Jun 2020 19:48:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id ABD462082E for ; Thu, 25 Jun 2020 19:48:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406858AbgFYTsT (ORCPT ); Thu, 25 Jun 2020 15:48:19 -0400 Received: from cloud.peff.net ([104.130.231.41]:43260 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406836AbgFYTsS (ORCPT ); Thu, 25 Jun 2020 15:48:18 -0400 Received: (qmail 31345 invoked by uid 109); 25 Jun 2020 19:48:18 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:18 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19579 invoked by uid 111); 25 Jun 2020 19:48:18 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:18 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:17 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 02/11] fast-export: use xmemdupz() for anonymizing oids Message-ID: <20200625194817.GB4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Our anonymize_mem() function is careful to take a ptr/len pair to allow storing binary tokens like object ids, as well as partial strings (e.g., just "foo" of "foo/bar"). But it duplicates the hash key using xstrdup()! That means that: - for a partial string, we'd store all bytes up to the NUL, even though we'd never look at anything past "len". This didn't produce wrong behavior, but was wasteful. - for a binary oid that doesn't contain a zero byte, we'd copy garbage bytes off the end of the array (though as long as nothing complained about reading uninitialized bytes, further reads would be limited by "len", and we'd produce the correct results) - for a binary oid that does contain a zero byte, we'd copy _fewer_ bytes than intended into the hashmap struct. When we later try to look up a value, we'd access uninitialized memory and potentially falsely claim that a particular oid is not present. The most common reason to store an oid is an anonymized gitlink, but our test case doesn't have any gitlinks at all. So let's add one whose oid contains a NUL and is present at two different paths. ASan catches the memory error, but even without it we can detect the bug because the oid is not anonymized the same way for both paths. And of course the fix is to copy the correct number of bytes. We don't technically need the appended NUL from xmemdupz(), but it doesn't hurt as an extra protection against anybody treating it like a string (plus a future patch will push us more in that direction). Signed-off-by: Jeff King --- builtin/fast-export.c | 2 +- t/t9351-fast-export-anonymize.sh | 15 +++++++++++++++ 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 85868162ee..289395a131 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -162,7 +162,7 @@ static const void *anonymize_mem(struct hashmap *map, if (!ret) { ret = xmalloc(sizeof(*ret)); hashmap_entry_init(&ret->hash, key.hash.hash); - ret->orig = xstrdup(orig); + ret->orig = xmemdupz(orig, *len); ret->orig_len = *len; ret->anon = generate(orig, len); ret->anon_len = *len; diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh index e772cf9930..dc5d75cd19 100755 --- a/t/t9351-fast-export-anonymize.sh +++ b/t/t9351-fast-export-anonymize.sh @@ -10,6 +10,10 @@ test_expect_success 'setup simple repo' ' mkdir subdir && test_commit subdir/bar && test_commit subdir/xyzzy && + fake_commit=$(echo $ZERO_OID | sed s/0/a/) && + git update-index --add --cacheinfo 160000,$fake_commit,link1 && + git update-index --add --cacheinfo 160000,$fake_commit,link2 && + git commit -m "add gitlink" && git tag -m "annotated tag" mytag ' @@ -26,6 +30,12 @@ test_expect_success 'stream omits path names' ' ! grep xyzzy stream ' +test_expect_success 'stream omits gitlink oids' ' + # avoid relying on the whole oid to remain hash-agnostic; this is + # plenty to be unique within our test case + ! grep a000000000000000000 stream +' + test_expect_success 'stream allows master as refname' ' grep master stream ' @@ -89,6 +99,11 @@ test_expect_success 'paths in subdir ended up in one tree' ' test_cmp expect actual ' +test_expect_success 'identical gitlinks got identical oid' ' + awk "/commit/ { print \$3 }" commits && + test_line_count = 1 commits +' + test_expect_success 'tag points to branch tip' ' git rev-parse $other_branch >expect && git for-each-ref --format="%(*objectname)" | grep . >actual && From patchwork Thu Jun 25 19:48:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626035 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5B03A912 for ; Thu, 25 Jun 2020 19:48:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4C709207D8 for ; Thu, 25 Jun 2020 19:48:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406862AbgFYTsV (ORCPT ); Thu, 25 Jun 2020 15:48:21 -0400 Received: from cloud.peff.net ([104.130.231.41]:43268 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406836AbgFYTsU (ORCPT ); Thu, 25 Jun 2020 15:48:20 -0400 Received: (qmail 31356 invoked by uid 109); 25 Jun 2020 19:48:20 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:20 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19586 invoked by uid 111); 25 Jun 2020 19:48:20 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:20 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:19 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 03/11] fast-export: store anonymized oids as hex strings Message-ID: <20200625194819.GC4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org When fast-export stores anonymized oids, it does so as binary strings. And while the anonymous mapping storage is binary-clean (at least as of the previous commit), this will become awkward when we start exposing more of it to the user. In particular, if we allow a method for retaining token "foo", then users may want to specify a hex oid as such a token. Let's just switch to storing the hex strings. The difference in memory usage is negligible (especially considering how infrequently we'd generally store an oid compared to, say, path components). Signed-off-by: Jeff King --- builtin/fast-export.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 289395a131..4a3a4c933e 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -387,16 +387,19 @@ static void *generate_fake_oid(const void *old, size_t *len) { static uint32_t counter = 1; /* avoid null oid */ const unsigned hashsz = the_hash_algo->rawsz; - unsigned char *out = xcalloc(hashsz, 1); - put_be32(out + hashsz - 4, counter++); - return out; + struct object_id oid; + char *hex = xmallocz(GIT_MAX_HEXSZ); + + oidclr(&oid); + put_be32(oid.hash + hashsz - 4, counter++); + return oid_to_hex_r(hex, &oid); } -static const struct object_id *anonymize_oid(const struct object_id *oid) +static const char *anonymize_oid(const char *oid_hex) { static struct hashmap objs; - size_t len = the_hash_algo->rawsz; - return anonymize_mem(&objs, generate_fake_oid, oid, &len); + size_t len = strlen(oid_hex); + return anonymize_mem(&objs, generate_fake_oid, oid_hex, &len); } static void show_filemodify(struct diff_queue_struct *q, @@ -455,9 +458,9 @@ static void show_filemodify(struct diff_queue_struct *q, */ if (no_data || S_ISGITLINK(spec->mode)) printf("M %06o %s ", spec->mode, - oid_to_hex(anonymize ? - anonymize_oid(&spec->oid) : - &spec->oid)); + anonymize ? + anonymize_oid(oid_to_hex(&spec->oid)) : + oid_to_hex(&spec->oid)); else { struct object *object = lookup_object(the_repository, &spec->oid); @@ -712,9 +715,10 @@ static void handle_commit(struct commit *commit, struct rev_info *rev, if (mark) printf(":%d\n", mark); else - printf("%s\n", oid_to_hex(anonymize ? - anonymize_oid(&obj->oid) : - &obj->oid)); + printf("%s\n", + anonymize ? + anonymize_oid(oid_to_hex(&obj->oid)) : + oid_to_hex(&obj->oid)); i++; } From patchwork Thu Jun 25 19:48:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626041 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 16FB492A for ; Thu, 25 Jun 2020 19:48:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 08E5E20781 for ; Thu, 25 Jun 2020 19:48:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406872AbgFYTs0 (ORCPT ); Thu, 25 Jun 2020 15:48:26 -0400 Received: from cloud.peff.net ([104.130.231.41]:43280 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406836AbgFYTsX (ORCPT ); Thu, 25 Jun 2020 15:48:23 -0400 Received: (qmail 31372 invoked by uid 109); 25 Jun 2020 19:48:22 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:22 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19595 invoked by uid 111); 25 Jun 2020 19:48:22 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:22 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:21 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 04/11] fast-export: tighten anonymize_mem() interface to handle only strings Message-ID: <20200625194821.GD4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org While the anonymize_mem() interface _can_ store arbitrary byte sequences, none of the callers uses this feature (as of the previous commit). We'd like to keep it that way, as we'll be exposing the string-like nature of the anonymization routines to the user. So let's tighten up the interface a bit: - don't treat "len" as an out-parameter from anonymize_mem(); this ensures callers treat the pointer result as a NUL-terminated string - likewise, don't treat "len" as an out-parameter from generator functions - swap out "void *" for "char *" as appropriate to signal that we don't handle arbitrary memory - rename the function to anonymize_str() This will also open up some optimization opportunities in a future patch. Note that we can't drop the "len" parameter entirely. Some callers do pass in partial strings (e.g., "foo/bar", len=3) to avoid copying, and we need to handle those still. Signed-off-by: Jeff King --- builtin/fast-export.c | 53 +++++++++++++++++++++---------------------- 1 file changed, 26 insertions(+), 27 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 4a3a4c933e..d8ea067630 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -145,31 +145,30 @@ static int anonymized_entry_cmp(const void *unused_cmp_data, * the same anonymized string with another. The actual generation * is farmed out to the generate function. */ -static const void *anonymize_mem(struct hashmap *map, - void *(*generate)(const void *, size_t *), - const void *orig, size_t *len) +static const char *anonymize_str(struct hashmap *map, + char *(*generate)(const char *, size_t), + const char *orig, size_t len) { struct anonymized_entry key, *ret; if (!map->cmpfn) hashmap_init(map, anonymized_entry_cmp, NULL, 0); - hashmap_entry_init(&key.hash, memhash(orig, *len)); + hashmap_entry_init(&key.hash, memhash(orig, len)); key.orig = orig; - key.orig_len = *len; + key.orig_len = len; ret = hashmap_get_entry(map, &key, hash, NULL); if (!ret) { ret = xmalloc(sizeof(*ret)); hashmap_entry_init(&ret->hash, key.hash.hash); - ret->orig = xmemdupz(orig, *len); - ret->orig_len = *len; + ret->orig = xmemdupz(orig, len); + ret->orig_len = len; ret->anon = generate(orig, len); - ret->anon_len = *len; + ret->anon_len = strlen(ret->anon); hashmap_put(map, &ret->hash); } - *len = ret->anon_len; return ret->anon; } @@ -181,13 +180,13 @@ static const void *anonymize_mem(struct hashmap *map, */ static void anonymize_path(struct strbuf *out, const char *path, struct hashmap *map, - void *(*generate)(const void *, size_t *)) + char *(*generate)(const char *, size_t)) { while (*path) { const char *end_of_component = strchrnul(path, '/'); size_t len = end_of_component - path; - const char *c = anonymize_mem(map, generate, path, &len); - strbuf_add(out, c, len); + const char *c = anonymize_str(map, generate, path, len); + strbuf_addstr(out, c); path = end_of_component; if (*path) strbuf_addch(out, *path++); @@ -361,12 +360,12 @@ static void print_path_1(const char *path) printf("%s", path); } -static void *anonymize_path_component(const void *path, size_t *len) +static char *anonymize_path_component(const char *path, size_t len) { static int counter; struct strbuf out = STRBUF_INIT; strbuf_addf(&out, "path%d", counter++); - return strbuf_detach(&out, len); + return strbuf_detach(&out, NULL); } static void print_path(const char *path) @@ -383,7 +382,7 @@ static void print_path(const char *path) } } -static void *generate_fake_oid(const void *old, size_t *len) +static char *generate_fake_oid(const char *old, size_t len) { static uint32_t counter = 1; /* avoid null oid */ const unsigned hashsz = the_hash_algo->rawsz; @@ -399,7 +398,7 @@ static const char *anonymize_oid(const char *oid_hex) { static struct hashmap objs; size_t len = strlen(oid_hex); - return anonymize_mem(&objs, generate_fake_oid, oid_hex, &len); + return anonymize_str(&objs, generate_fake_oid, oid_hex, len); } static void show_filemodify(struct diff_queue_struct *q, @@ -496,12 +495,12 @@ static const char *find_encoding(const char *begin, const char *end) return bol; } -static void *anonymize_ref_component(const void *old, size_t *len) +static char *anonymize_ref_component(const char *old, size_t len) { static int counter; struct strbuf out = STRBUF_INIT; strbuf_addf(&out, "ref%d", counter++); - return strbuf_detach(&out, len); + return strbuf_detach(&out, NULL); } static const char *anonymize_refname(const char *refname) @@ -550,13 +549,13 @@ static char *anonymize_commit_message(const char *old) } static struct hashmap idents; -static void *anonymize_ident(const void *old, size_t *len) +static char *anonymize_ident(const char *old, size_t len) { static int counter; struct strbuf out = STRBUF_INIT; strbuf_addf(&out, "User %d ", counter, counter); counter++; - return strbuf_detach(&out, len); + return strbuf_detach(&out, NULL); } /* @@ -591,9 +590,9 @@ static void anonymize_ident_line(const char **beg, const char **end) size_t len; len = split.mail_end - split.name_begin; - ident = anonymize_mem(&idents, anonymize_ident, - split.name_begin, &len); - strbuf_add(out, ident, len); + ident = anonymize_str(&idents, anonymize_ident, + split.name_begin, len); + strbuf_addstr(out, ident); strbuf_addch(out, ' '); strbuf_add(out, split.date_begin, split.tz_end - split.date_begin); } else { @@ -733,12 +732,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev, show_progress(); } -static void *anonymize_tag(const void *old, size_t *len) +static char *anonymize_tag(const char *old, size_t len) { static int counter; struct strbuf out = STRBUF_INIT; strbuf_addf(&out, "tag message %d", counter++); - return strbuf_detach(&out, len); + return strbuf_detach(&out, NULL); } static void handle_tail(struct object_array *commits, struct rev_info *revs, @@ -808,8 +807,8 @@ static void handle_tag(const char *name, struct tag *tag) name = anonymize_refname(name); if (message) { static struct hashmap tags; - message = anonymize_mem(&tags, anonymize_tag, - message, &message_size); + message = anonymize_str(&tags, anonymize_tag, + message, message_size); } } From patchwork Thu Jun 25 19:48:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626037 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 055F3912 for ; Thu, 25 Jun 2020 19:48:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E5EE3207D8 for ; Thu, 25 Jun 2020 19:48:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406874AbgFYTs1 (ORCPT ); Thu, 25 Jun 2020 15:48:27 -0400 Received: from cloud.peff.net ([104.130.231.41]:43302 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406635AbgFYTsZ (ORCPT ); Thu, 25 Jun 2020 15:48:25 -0400 Received: (qmail 31393 invoked by uid 109); 25 Jun 2020 19:48:24 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:24 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19626 invoked by uid 111); 25 Jun 2020 19:48:25 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:25 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:23 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 05/11] fast-export: stop storing lengths in anonymized hashmaps Message-ID: <20200625194823.GE4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Now that the anonymize_str() interface is restricted to NUL-terminated strings, there's no need for us to keep track of the length of each entry in the hashmap. This simplifies the code and saves a bit of memory. Note that we do still need to compare the stored results to partial strings passed in by the callers. We can do that by using hashmap's keydata feature to get the ptr/len pair into the comparison function, and then using strncmp(). Signed-off-by: Jeff King --- builtin/fast-export.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index d8ea067630..5df2ada47d 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -121,23 +121,32 @@ static int has_unshown_parent(struct commit *commit) struct anonymized_entry { struct hashmap_entry hash; const char *orig; - size_t orig_len; const char *anon; - size_t anon_len; +}; + +struct anonymized_entry_key { + struct hashmap_entry hash; + const char *orig; + size_t orig_len; }; static int anonymized_entry_cmp(const void *unused_cmp_data, const struct hashmap_entry *eptr, const struct hashmap_entry *entry_or_key, - const void *unused_keydata) + const void *keydata) { const struct anonymized_entry *a, *b; a = container_of(eptr, const struct anonymized_entry, hash); - b = container_of(entry_or_key, const struct anonymized_entry, hash); + if (keydata) { + const struct anonymized_entry_key *key = keydata; + int equal = !strncmp(a->orig, key->orig, key->orig_len) && + !a->orig[key->orig_len]; + return !equal; + } - return a->orig_len != b->orig_len || - memcmp(a->orig, b->orig, a->orig_len); + b = container_of(entry_or_key, const struct anonymized_entry, hash); + return strcmp(a->orig, b->orig); } /* @@ -149,23 +158,22 @@ static const char *anonymize_str(struct hashmap *map, char *(*generate)(const char *, size_t), const char *orig, size_t len) { - struct anonymized_entry key, *ret; + struct anonymized_entry_key key; + struct anonymized_entry *ret; if (!map->cmpfn) hashmap_init(map, anonymized_entry_cmp, NULL, 0); hashmap_entry_init(&key.hash, memhash(orig, len)); key.orig = orig; key.orig_len = len; - ret = hashmap_get_entry(map, &key, hash, NULL); + ret = hashmap_get_entry(map, &key, hash, &key); if (!ret) { ret = xmalloc(sizeof(*ret)); hashmap_entry_init(&ret->hash, key.hash.hash); ret->orig = xmemdupz(orig, len); - ret->orig_len = len; ret->anon = generate(orig, len); - ret->anon_len = strlen(ret->anon); hashmap_put(map, &ret->hash); } From patchwork Thu Jun 25 19:48:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626039 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A96AF912 for ; Thu, 25 Jun 2020 19:48:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8D1EE207D8 for ; Thu, 25 Jun 2020 19:48:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406877AbgFYTs1 (ORCPT ); Thu, 25 Jun 2020 15:48:27 -0400 Received: from cloud.peff.net ([104.130.231.41]:43308 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406866AbgFYTs1 (ORCPT ); Thu, 25 Jun 2020 15:48:27 -0400 Received: (qmail 31409 invoked by uid 109); 25 Jun 2020 19:48:27 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:27 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19646 invoked by uid 111); 25 Jun 2020 19:48:27 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:27 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:25 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 06/11] fast-export: use a flex array to store anonymized entries Message-ID: <20200625194825.GF4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Now that we're using a separate keydata struct for hash lookups, we have more flexibility in how we allocate anonymized_entry structs. Let's push the "orig" key into a flex member within the struct. That should save us a few bytes of memory per entry (a pointer plus any malloc overhead), and may make lookups a little faster (since it's one less pointer to chase in the comparison function). Signed-off-by: Jeff King --- builtin/fast-export.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 5df2ada47d..99d4a2b404 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -120,8 +120,8 @@ static int has_unshown_parent(struct commit *commit) struct anonymized_entry { struct hashmap_entry hash; - const char *orig; const char *anon; + const char orig[FLEX_ARRAY]; }; struct anonymized_entry_key { @@ -170,9 +170,8 @@ static const char *anonymize_str(struct hashmap *map, ret = hashmap_get_entry(map, &key, hash, &key); if (!ret) { - ret = xmalloc(sizeof(*ret)); + FLEX_ALLOC_MEM(ret, orig, orig, len); hashmap_entry_init(&ret->hash, key.hash.hash); - ret->orig = xmemdupz(orig, len); ret->anon = generate(orig, len); hashmap_put(map, &ret->hash); } From patchwork Thu Jun 25 19:48:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626043 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 24BDE912 for ; Thu, 25 Jun 2020 19:48:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0C62520781 for ; Thu, 25 Jun 2020 19:48:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406879AbgFYTsb (ORCPT ); Thu, 25 Jun 2020 15:48:31 -0400 Received: from cloud.peff.net ([104.130.231.41]:43318 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406844AbgFYTs3 (ORCPT ); Thu, 25 Jun 2020 15:48:29 -0400 Received: (qmail 31422 invoked by uid 109); 25 Jun 2020 19:48:29 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:29 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19653 invoked by uid 111); 25 Jun 2020 19:48:29 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:29 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:28 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 07/11] fast-export: move global "idents" anonymize hashmap into function Message-ID: <20200625194828.GG4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org All of the other anonymization functions keep their static mappings inside the function to avoid polluting the global namespace. Let's do the same for "idents", as nobody needs it outside of anonymize_ident_line(). Signed-off-by: Jeff King --- builtin/fast-export.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 99d4a2b404..16a1563e49 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -555,7 +555,6 @@ static char *anonymize_commit_message(const char *old) return xstrfmt("subject %d\n\nbody\n", counter++); } -static struct hashmap idents; static char *anonymize_ident(const char *old, size_t len) { static int counter; @@ -572,6 +571,7 @@ static char *anonymize_ident(const char *old, size_t len) */ static void anonymize_ident_line(const char **beg, const char **end) { + static struct hashmap idents; static struct strbuf buffers[] = { STRBUF_INIT, STRBUF_INIT }; static unsigned which_buffer; From patchwork Thu Jun 25 19:48:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626045 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C811B92A for ; Thu, 25 Jun 2020 19:48:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B83912076E for ; Thu, 25 Jun 2020 19:48:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406889AbgFYTsc (ORCPT ); Thu, 25 Jun 2020 15:48:32 -0400 Received: from cloud.peff.net ([104.130.231.41]:43342 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406844AbgFYTsb (ORCPT ); Thu, 25 Jun 2020 15:48:31 -0400 Received: (qmail 31448 invoked by uid 109); 25 Jun 2020 19:48:31 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:31 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19686 invoked by uid 111); 25 Jun 2020 19:48:31 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:31 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:30 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 08/11] fast-export: add a "data" callback parameter to anonymize_str() Message-ID: <20200625194830.GH4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org The anonymize_str() function takes a generator callback, but there's no way to pass extra context to it. Let's add the usual "void *data" parameter to the generator interface and pass it along. This is mildly annoying for existing callers, all of which pass NULL, but is necessary to avoid extra globals in some cases we'll add in a subsequent patch. While we're touching each of these callbacks, we can further observe that none of them use the existing orig/len parameters at all. This makes sense, since the point is for their output to have no discernable basis in the original (my original version had some notion that we might use a one-way function to obfuscate the names, but it was never implemented). So let's drop those extra parameters. If a caller really wants to do something with them, it can pass a struct through the new data parameter. Signed-off-by: Jeff King --- builtin/fast-export.c | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 16a1563e49..1cbca5b4b4 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -155,8 +155,9 @@ static int anonymized_entry_cmp(const void *unused_cmp_data, * is farmed out to the generate function. */ static const char *anonymize_str(struct hashmap *map, - char *(*generate)(const char *, size_t), - const char *orig, size_t len) + char *(*generate)(void *), + const char *orig, size_t len, + void *data) { struct anonymized_entry_key key; struct anonymized_entry *ret; @@ -172,7 +173,7 @@ static const char *anonymize_str(struct hashmap *map, if (!ret) { FLEX_ALLOC_MEM(ret, orig, orig, len); hashmap_entry_init(&ret->hash, key.hash.hash); - ret->anon = generate(orig, len); + ret->anon = generate(data); hashmap_put(map, &ret->hash); } @@ -187,12 +188,12 @@ static const char *anonymize_str(struct hashmap *map, */ static void anonymize_path(struct strbuf *out, const char *path, struct hashmap *map, - char *(*generate)(const char *, size_t)) + char *(*generate)(void *)) { while (*path) { const char *end_of_component = strchrnul(path, '/'); size_t len = end_of_component - path; - const char *c = anonymize_str(map, generate, path, len); + const char *c = anonymize_str(map, generate, path, len, NULL); strbuf_addstr(out, c); path = end_of_component; if (*path) @@ -367,7 +368,7 @@ static void print_path_1(const char *path) printf("%s", path); } -static char *anonymize_path_component(const char *path, size_t len) +static char *anonymize_path_component(void *data) { static int counter; struct strbuf out = STRBUF_INIT; @@ -389,7 +390,7 @@ static void print_path(const char *path) } } -static char *generate_fake_oid(const char *old, size_t len) +static char *generate_fake_oid(void *data) { static uint32_t counter = 1; /* avoid null oid */ const unsigned hashsz = the_hash_algo->rawsz; @@ -405,7 +406,7 @@ static const char *anonymize_oid(const char *oid_hex) { static struct hashmap objs; size_t len = strlen(oid_hex); - return anonymize_str(&objs, generate_fake_oid, oid_hex, len); + return anonymize_str(&objs, generate_fake_oid, oid_hex, len, NULL); } static void show_filemodify(struct diff_queue_struct *q, @@ -502,7 +503,7 @@ static const char *find_encoding(const char *begin, const char *end) return bol; } -static char *anonymize_ref_component(const char *old, size_t len) +static char *anonymize_ref_component(void *data) { static int counter; struct strbuf out = STRBUF_INIT; @@ -555,7 +556,7 @@ static char *anonymize_commit_message(const char *old) return xstrfmt("subject %d\n\nbody\n", counter++); } -static char *anonymize_ident(const char *old, size_t len) +static char *anonymize_ident(void *data) { static int counter; struct strbuf out = STRBUF_INIT; @@ -598,7 +599,7 @@ static void anonymize_ident_line(const char **beg, const char **end) len = split.mail_end - split.name_begin; ident = anonymize_str(&idents, anonymize_ident, - split.name_begin, len); + split.name_begin, len, NULL); strbuf_addstr(out, ident); strbuf_addch(out, ' '); strbuf_add(out, split.date_begin, split.tz_end - split.date_begin); @@ -739,7 +740,7 @@ static void handle_commit(struct commit *commit, struct rev_info *rev, show_progress(); } -static char *anonymize_tag(const char *old, size_t len) +static char *anonymize_tag(void *data) { static int counter; struct strbuf out = STRBUF_INIT; @@ -815,7 +816,7 @@ static void handle_tag(const char *name, struct tag *tag) if (message) { static struct hashmap tags; message = anonymize_str(&tags, anonymize_tag, - message, message_size); + message, message_size, NULL); } } From patchwork Thu Jun 25 19:48:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626047 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BBF2D92A for ; Thu, 25 Jun 2020 19:48:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A36DE2080C for ; Thu, 25 Jun 2020 19:48:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406895AbgFYTsf (ORCPT ); Thu, 25 Jun 2020 15:48:35 -0400 Received: from cloud.peff.net ([104.130.231.41]:43358 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406844AbgFYTse (ORCPT ); Thu, 25 Jun 2020 15:48:34 -0400 Received: (qmail 31460 invoked by uid 109); 25 Jun 2020 19:48:34 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:34 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19706 invoked by uid 111); 25 Jun 2020 19:48:34 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:34 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:32 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 09/11] fast-export: allow seeding the anonymized mapping Message-ID: <20200625194832.GI4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org After you anonymize a repository, it can be hard to find which commits correspond between the original and the result, and thus hard to reproduce commands that triggered bugs in the original. Let's make it possible to seed the anonymization map. This lets users either: - mark names to be retained as-is, if they don't consider them secret (in which case their original commands would just work) - map names to new values, which lets them adapt the reproduction recipe to the new names without revealing the originals The implementation is fairly straight-forward. We already store each anonymized token in a hashmap (so that the same token appearing twice is converted to the same result). We can just introduce a new "seed" hashmap which is consulted first. This does make a few more promises to the user about how we'll anonymize things (e.g., token-splitting pathnames). But it's unlikely that we'd want to change those rules, even if the actual anonymization of a single token changes. And it makes things much easier for the user, who can unblind only a directory name without having to specify each path within it. One alternative to this approach would be to anonymize as we see fit, and then dump the whole refname and pathname mappings to a file. This does work, but it's a bit awkward to use (you have to manually dig the items you care about out of the mapping). Helped-by: Eric Sunshine Signed-off-by: Jeff King --- Documentation/git-fast-export.txt | 29 ++++++++++++++++++ builtin/fast-export.c | 50 ++++++++++++++++++++++++++++++- t/t9351-fast-export-anonymize.sh | 11 ++++++- 3 files changed, 88 insertions(+), 2 deletions(-) diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index e8950de3ba..1978dbdc6a 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -119,6 +119,11 @@ by keeping the marks the same across runs. the shape of the history and stored tree. See the section on `ANONYMIZING` below. +--anonymize-map=[:]:: + Convert token `` to `` in the anonymized output. If + `` is omitted, map `` to itself (i.e., do not + anonymize it). See the section on `ANONYMIZING` below. + --reference-excluded-parents:: By default, running a command such as `git fast-export master~5..master` will not include the commit master{tilde}5 @@ -238,6 +243,30 @@ collapse "User 0", "User 1", etc into "User X"). This produces a much smaller output, and it is usually easy to quickly confirm that there is no private data in the stream. +Reproducing some bugs may require referencing particular commits or +paths, which becomes challenging after refnames and paths have been +anonymized. You can ask for a particular token to be left as-is or +mapped to a new value. For example, if you have a bug which reproduces +with `git rev-list sensitive -- secret.c`, you can run: + +--------------------------------------------------- +$ git fast-export --anonymize --all \ + --anonymize-map=sensitive:foo \ + --anonymize-map=secret.c:bar.c \ + >stream +--------------------------------------------------- + +After importing the stream, you can then run `git rev-list foo -- bar.c` +in the anonymized repository. + +Note that paths and refnames are split into tokens at slash boundaries. +The command above would anonymize `subdir/secret.c` as something like +`path123/bar.c`; you could then search for `bar.c` in the anonymized +repository to determine the final pathname. + +To make referencing the final pathname simpler, you can map each path +component; so if you also anonymize `subdir` to `publicdir`, then the +final pathname would be `publicdir/bar.c`. LIMITATIONS ----------- diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 1cbca5b4b4..b0b09bca30 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -45,6 +45,7 @@ static struct string_list extra_refs = STRING_LIST_INIT_NODUP; static struct string_list tag_refs = STRING_LIST_INIT_NODUP; static struct refspec refspecs = REFSPEC_INIT_FETCH; static int anonymize; +static struct hashmap anonymized_seeds; static struct revision_sources revision_sources; static int parse_opt_signed_tag_mode(const struct option *opt, @@ -168,8 +169,18 @@ static const char *anonymize_str(struct hashmap *map, hashmap_entry_init(&key.hash, memhash(orig, len)); key.orig = orig; key.orig_len = len; - ret = hashmap_get_entry(map, &key, hash, &key); + /* First check if it's a token the user configured manually... */ + if (anonymized_seeds.cmpfn) + ret = hashmap_get_entry(&anonymized_seeds, &key, hash, &key); + else + ret = NULL; + + /* ...otherwise check if we've already seen it in this context... */ + if (!ret) + ret = hashmap_get_entry(map, &key, hash, &key); + + /* ...and finally generate a new mapping if necessary */ if (!ret) { FLEX_ALLOC_MEM(ret, orig, orig, len); hashmap_entry_init(&ret->hash, key.hash.hash); @@ -1147,6 +1158,37 @@ static void handle_deletes(void) } } +static char *anonymize_seed(void *data) +{ + return xstrdup(data); +} + +static int parse_opt_anonymize_map(const struct option *opt, + const char *arg, int unset) +{ + struct hashmap *map = opt->value; + const char *delim, *value; + size_t keylen; + + BUG_ON_OPT_NEG(unset); + + delim = strchr(arg, ':'); + if (delim) { + keylen = delim - arg; + value = delim + 1; + } else { + keylen = strlen(arg); + value = arg; + } + + if (!keylen || !*value) + return error(_("--anonymize-map token cannot be empty")); + + anonymize_str(map, anonymize_seed, arg, keylen, (void *)value); + + return 0; +} + int cmd_fast_export(int argc, const char **argv, const char *prefix) { struct rev_info revs; @@ -1188,6 +1230,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"), N_("Apply refspec to exported refs")), OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")), + OPT_CALLBACK_F(0, "anonymize-map", &anonymized_seeds, N_("from:to"), + N_("convert to in anonymized output"), + PARSE_OPT_NONEG, parse_opt_anonymize_map), OPT_BOOL(0, "reference-excluded-parents", &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")), OPT_BOOL(0, "show-original-ids", &show_original_ids, @@ -1215,6 +1260,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) if (argc > 1) usage_with_options (fast_export_usage, options); + if (anonymized_seeds.cmpfn && !anonymize) + die(_("--anonymize-map without --anonymize does not make sense")); + if (refspecs_list.nr) { int i; diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh index dc5d75cd19..5a21c71568 100755 --- a/t/t9351-fast-export-anonymize.sh +++ b/t/t9351-fast-export-anonymize.sh @@ -6,6 +6,7 @@ test_description='basic tests for fast-export --anonymize' test_expect_success 'setup simple repo' ' test_commit base && test_commit foo && + test_commit retain-me && git checkout -b other HEAD^ && mkdir subdir && test_commit subdir/bar && @@ -18,7 +19,10 @@ test_expect_success 'setup simple repo' ' ' test_expect_success 'export anonymized stream' ' - git fast-export --anonymize --all >stream + git fast-export --anonymize --all \ + --anonymize-map=retain-me \ + --anonymize-map=xyzzy:custom-name \ + >stream ' # this also covers commit messages @@ -30,6 +34,11 @@ test_expect_success 'stream omits path names' ' ! grep xyzzy stream ' +test_expect_success 'stream contains user-specified names' ' + grep retain-me stream && + grep custom-name stream +' + test_expect_success 'stream omits gitlink oids' ' # avoid relying on the whole oid to remain hash-agnostic; this is # plenty to be unique within our test case From patchwork Thu Jun 25 19:48:35 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626051 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9052E92A for ; Thu, 25 Jun 2020 19:48:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8126C207D8 for ; Thu, 25 Jun 2020 19:48:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406926AbgFYTsn (ORCPT ); Thu, 25 Jun 2020 15:48:43 -0400 Received: from cloud.peff.net ([104.130.231.41]:43368 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406899AbgFYTsh (ORCPT ); Thu, 25 Jun 2020 15:48:37 -0400 Received: (qmail 31480 invoked by uid 109); 25 Jun 2020 19:48:36 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:36 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19739 invoked by uid 111); 25 Jun 2020 19:48:36 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:36 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:35 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 10/11] fast-export: anonymize "master" refname Message-ID: <20200625194835.GJ4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Running "fast-export --anonymize" will leave "refs/heads/master" untouched in the output, for two reasons: - it helped to have some known reference point between the original and anonymized repository - since it's historically the default branch name, it doesn't leak any information Now that we can ask fast-export to retain particular tokens, we have a much better tool for the first one (because it works for any ref, not just master). For the second, the notion of "default branch name" is likely to become configurable soon, at which point the name _does_ leak information. Let's drop this special case in preparation. Note that we have to adjust the test a bit, since it relied on using the name "master" in the anonymized repos. We could just use --anonymize-map=master to keep the same output, but then we wouldn't know if it works because of our hard-coded master or because of the explicit map. So let's flip the test a bit, and confirm that we anonymize "master", but keep "other" in the output. Signed-off-by: Jeff King --- builtin/fast-export.c | 7 ------- t/t9351-fast-export-anonymize.sh | 12 +++++++----- 2 files changed, 7 insertions(+), 12 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index b0b09bca30..c6ecf404d7 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -538,13 +538,6 @@ static const char *anonymize_refname(const char *refname) static struct strbuf anon = STRBUF_INIT; int i; - /* - * We also leave "master" as a special case, since it does not reveal - * anything interesting. - */ - if (!strcmp(refname, "refs/heads/master")) - return refname; - strbuf_reset(&anon); for (i = 0; i < ARRAY_SIZE(prefixes); i++) { if (skip_prefix(refname, prefixes[i], &refname)) { diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh index 5a21c71568..5ac2c3b5ee 100755 --- a/t/t9351-fast-export-anonymize.sh +++ b/t/t9351-fast-export-anonymize.sh @@ -22,6 +22,7 @@ test_expect_success 'export anonymized stream' ' git fast-export --anonymize --all \ --anonymize-map=retain-me \ --anonymize-map=xyzzy:custom-name \ + --anonymize-map=other \ >stream ' @@ -45,12 +46,12 @@ test_expect_success 'stream omits gitlink oids' ' ! grep a000000000000000000 stream ' -test_expect_success 'stream allows master as refname' ' - grep master stream +test_expect_success 'stream retains other as refname' ' + grep other stream ' test_expect_success 'stream omits other refnames' ' - ! grep other stream && + ! grep master stream && ! grep mytag stream ' @@ -76,15 +77,16 @@ test_expect_success 'import stream to new repository' ' test_expect_success 'result has two branches' ' git for-each-ref --format="%(refname)" refs/heads >branches && test_line_count = 2 branches && - other_branch=$(grep -v refs/heads/master branches) + other_branch=refs/heads/other && + main_branch=$(grep -v $other_branch branches) ' test_expect_success 'repo has original shape and timestamps' ' shape () { git log --format="%m %ct" --left-right --boundary "$@" } && (cd .. && shape master...other) >expect && - shape master...$other_branch >actual && + shape $main_branch...$other_branch >actual && test_cmp expect actual ' From patchwork Thu Jun 25 19:48:37 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 11626049 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C638E912 for ; Thu, 25 Jun 2020 19:48:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A869C2080C for ; Thu, 25 Jun 2020 19:48:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2406844AbgFYTsm (ORCPT ); Thu, 25 Jun 2020 15:48:42 -0400 Received: from cloud.peff.net ([104.130.231.41]:43384 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406926AbgFYTsj (ORCPT ); Thu, 25 Jun 2020 15:48:39 -0400 Received: (qmail 31499 invoked by uid 109); 25 Jun 2020 19:48:38 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 25 Jun 2020 19:48:38 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 19746 invoked by uid 111); 25 Jun 2020 19:48:39 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 25 Jun 2020 15:48:39 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 25 Jun 2020 15:48:37 -0400 From: Jeff King To: git@vger.kernel.org Cc: Eric Sunshine , Junio C Hamano , Johannes Schindelin , SZEDER =?utf-8?b?R8OhYm9y?= Subject: [PATCH v2 11/11] fast-export: use local array to store anonymized oid Message-ID: <20200625194837.GK4029374@coredump.intra.peff.net> References: <20200625194802.GA4028913@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200625194802.GA4028913@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Some older versions of gcc complain about this line: builtin/fast-export.c:412:2: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing] put_be32(oid.hash + hashsz - 4, counter++); ^ This seems to be a false positive, as there's no type-punning at all here. oid.hash is an array of unsigned char; when we pass it to a function it decays to a pointer to unsigned char. We do take a void pointer in put_be32(), but it's immediately aliased with another pointer to unsigned char (and clearly the compiler is looking inside the inlined put_be32(), since the warning doesn't happen with -O0). This happens on gcc 4.8 and 4.9, but not later versions (I tested gcc 6, 7, 8, and 9). We can work around it by using a local array instead of an object_id struct. This is a little more intimate with the details of object_id, but for whatever reason doesn't seem to trigger the compiler warning. We can revert this patch once we decide that those gcc versions are too old to care about for a warning like this (gcc 4.8 is the default compiler for Ubuntu Trusty, which is out-of-support but not fully end-of-life'd until April 2022). Signed-off-by: Jeff King --- builtin/fast-export.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index c6ecf404d7..9f37895d4c 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -405,12 +405,12 @@ static char *generate_fake_oid(void *data) { static uint32_t counter = 1; /* avoid null oid */ const unsigned hashsz = the_hash_algo->rawsz; - struct object_id oid; + unsigned char out[GIT_MAX_RAWSZ]; char *hex = xmallocz(GIT_MAX_HEXSZ); - oidclr(&oid); - put_be32(oid.hash + hashsz - 4, counter++); - return oid_to_hex_r(hex, &oid); + hashclr(out); + put_be32(out + hashsz - 4, counter++); + return hash_to_hex_algop_r(hex, out, the_hash_algo); } static const char *anonymize_oid(const char *oid_hex)