From patchwork Tue Jun 23 17:47:00 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621265 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 219BC60D for ; Tue, 23 Jun 2020 17:47:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 01DD3206D4 for ; Tue, 23 Jun 2020 17:47:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KT98GLqL" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733092AbgFWRrP (ORCPT ); Tue, 23 Jun 2020 13:47:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56846 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732549AbgFWRrO (ORCPT ); Tue, 23 Jun 2020 13:47:14 -0400 Received: from mail-wm1-x341.google.com (mail-wm1-x341.google.com [IPv6:2a00:1450:4864:20::341]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4C8D4C061755 for ; Tue, 23 Jun 2020 10:47:14 -0700 (PDT) Received: by mail-wm1-x341.google.com with SMTP id f139so2742558wmf.5 for ; Tue, 23 Jun 2020 10:47:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=ppLWTMM/QtjlkJ6BAXu8a4TJ2DJ2gu6KkEEaeIgaoCc=; b=KT98GLqLfpDu/1PLBRXoXQAyUcwnzeNdZxeATapa+ltfGbVxXRLbTr7h9TRyyjLEUc 8dK4ZSl8uLApLyN9OeZwCzxAloHwOw960FwHal+8p+ZNzEsggCVJoRYtmqy8ZURTIYVB GTcXuXNE/8SwPljL05T0/nBZSAFh5jxbXoqaZ4qorrmuWaHyUfidWN7diPXpQNMbk743 aGoL/oAzNytE5zi8wX5lVxvX9ApwJH+gt6v5+GoPHZNGRzxRHkVXFLua9fzBm7PyhH1Y 1QznuzzajggXhag0iBRCJ6JlDQJoBIgb2GFZSDvCO+WBeVmqBfufBsNXwLQ/nplGTMWm SnUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=ppLWTMM/QtjlkJ6BAXu8a4TJ2DJ2gu6KkEEaeIgaoCc=; b=Gi1ummWsEt2L17dVZ6lPMreWopjnBpCop1jWHUqyOjANfp5CMt0DiTIINwQbj3xz2u bU3SnPXUcGjDraIi3a1WZUTDSYmSWUXNy4wPQ2kxeezQUqiAy8QbOjCoomJCzoU4u9am GDf9UVB15XDcRRaapit3XaTQJx9NIjWlJbZGMfqTHFT2IsEQHgffgN5F8QrbqsI7t917 /vbsucC1wR68haaTyPrCcNqjLEgnkSPonr/aVTX3gmxPfG+Qu9IU3hedjJmbfT/7dhpu IADGfg90FZFuGMoLAKIAEiUiQtiZR7N/ALsqeAnl67YeR0aHp2OYg4m4/8CAjHfc2fPq eoEw== X-Gm-Message-State: AOAM533XxL9tmxMYeHkHqehHwWxAETsIFk+gE6wJQM/mYbS50A7tf/1H 33Mx0EUPoeP/YP8lKwqvEoSv8+Yl X-Google-Smtp-Source: ABdhPJyaG4mHYZTWF71D5yYcuup2xu9S+o2G5FDnpUsPrT+PIok90FH2YD2HL3sDWmpvCxSiyEmzjA== X-Received: by 2002:a05:600c:221a:: with SMTP id z26mr18661697wml.177.1592934432882; Tue, 23 Jun 2020 10:47:12 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id v20sm4404664wmh.26.2020.06.23.10.47.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:12 -0700 (PDT) Message-Id: <57002040bc83668d4998da6a1e34a6efe287ae22.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Derrick Stolee via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:00 +0000 Subject: [PATCH v2 01/11] commit-graph: place bloom_settings in context MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , Derrick Stolee Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Derrick Stolee Place an instance of struct bloom_settings into the struct write_commit_graph_context. This allows simplifying the function prototype of write_graph_chunk_bloom_data(). This will allow us to combine the function prototypes and use function pointers to simplify write_commit_graph_file(). By using a pointer, we can later replace the settings to match those that exist in the current commit-graph, in case a future Git version allows customization of these parameters. Reported-by: SZEDER Gábor Signed-off-by: Derrick Stolee --- commit-graph.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 887837e882..d0fedcd9b1 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -882,6 +882,7 @@ struct write_commit_graph_context { const struct split_commit_graph_opts *split_opts; size_t total_bloom_filter_data_size; + const struct bloom_filter_settings *bloom_settings; }; static void write_graph_chunk_fanout(struct hashfile *f, @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f, } static void write_graph_chunk_bloom_data(struct hashfile *f, - struct write_commit_graph_context *ctx, - const struct bloom_filter_settings *settings) + struct write_commit_graph_context *ctx) { struct commit **list = ctx->commits.list; struct commit **last = ctx->commits.list + ctx->commits.nr; @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f, _("Writing changed paths Bloom filters data"), ctx->commits.nr); - hashwrite_be32(f, settings->hash_version); - hashwrite_be32(f, settings->num_hashes); - hashwrite_be32(f, settings->bits_per_entry); + hashwrite_be32(f, ctx->bloom_settings->hash_version); + hashwrite_be32(f, ctx->bloom_settings->num_hashes); + hashwrite_be32(f, ctx->bloom_settings->bits_per_entry); while (list < last) { struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0); @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) struct object_id file_hash; const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS; + ctx->bloom_settings = &bloom_settings; + if (ctx->split) { struct strbuf tmp_file = STRBUF_INIT; @@ -1642,7 +1644,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) write_graph_chunk_extra_edges(f, ctx); if (ctx->changed_paths) { write_graph_chunk_bloom_indexes(f, ctx); - write_graph_chunk_bloom_data(f, ctx, &bloom_settings); + write_graph_chunk_bloom_data(f, ctx); } if (ctx->num_commit_graphs_after > 1 && write_graph_chunk_base(f, ctx)) { From patchwork Tue Jun 23 17:47:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621269 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 08AB514F6 for ; Tue, 23 Jun 2020 17:47:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E64B520780 for ; Tue, 23 Jun 2020 17:47:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="cHoF+Y1K" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733234AbgFWRrT (ORCPT ); Tue, 23 Jun 2020 13:47:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56852 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732549AbgFWRrP (ORCPT ); Tue, 23 Jun 2020 13:47:15 -0400 Received: from mail-wr1-x441.google.com (mail-wr1-x441.google.com [IPv6:2a00:1450:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 27BC9C061573 for ; Tue, 23 Jun 2020 10:47:15 -0700 (PDT) Received: by mail-wr1-x441.google.com with SMTP id z13so9697274wrw.5 for ; Tue, 23 Jun 2020 10:47:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=UiCtqTJGVEULmS7sPq1CCgWHFh2xZ9/Kbd4Wgi4nLYI=; b=cHoF+Y1KtYEzyWlkH2HA32I0VkomZd6vrelTAfxvTmhoWz5+nf7Ig/TcQOMeUfo0nS 8r8f3+E13hQfcCFkoArog9OlCcQuWkHSgPYVzcYdUZbKtokBwHIqtByH9+qqEMOUQu3w 0rq4gvOkGedm2Rbg2iDAT7YkS5nkYlqB9E2uBcnMEKDSJyWWwO+6ZKZoEamtyjFsHnfP e3GZT0bO4GKgiGXHTEdmUGmoNq1Yq+kvkyZjWRap/kvYLoNZCr/F1Iv2GY9yuQ3ZFoWY 8nxbbnnbrqsqd9mWaQmS80qmI5iiXhZYv7erIb7CnpPk/dI+PJlC/6ue0jSoWAAhEBXf eldg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=UiCtqTJGVEULmS7sPq1CCgWHFh2xZ9/Kbd4Wgi4nLYI=; b=pqUkeiHphugFH+XURMPsM7ISoRb2FW6jRZEIq6pUvmWqMkalaua1kKaSAQ9cx5iWIB M3mcWrFale9H/qllku3ttoAzsOeUhTeonxoqnDPeeHo7KfZRU7gXdHJsmLYM4K1GsKrs AAoAoXuxQmAMz86evMwAGlf4xzp6D2OMXnY7iyv9EoD2tdPX/UhOcWi1vE/l6sEABvyj yxEAzOtYzuelFzRG4W7Aewzf+61i9y+jmhLGxcd8ImEgN8bSJD0F+g5rBMbI8IP6FfOf KXPfDgaiUltl2jgXXUU8G0vVaceG9Uqova1Wf8GLPG0ZvYRviXzL9/xDBDTEypwujnrF hcqg== X-Gm-Message-State: AOAM5336kITvpPjzSyQbByiPDbcOf9b9wPhi51/ca4gxiFKesQkdVmkd JD+cMTJuSIMhkrFd5XOug9dpl2Z1 X-Google-Smtp-Source: ABdhPJwyR5ddMAyG9eEPAjR72Xh5LyvEcfEYsOmYq51O5a/fZNdtlYLu0PC6NqpwCKrUwPPcvpZ2hA== X-Received: by 2002:adf:ecc8:: with SMTP id s8mr26819030wro.317.1592934433670; Tue, 23 Jun 2020 10:47:13 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id c5sm4665437wmb.24.2020.06.23.10.47.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:13 -0700 (PDT) Message-Id: <6b63f9bd8a2a7e18d7ac1be7066d4bcd1df2a729.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Derrick Stolee via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:01 +0000 Subject: [PATCH v2 02/11] commit-graph: change test to die on parse, not load Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , Derrick Stolee Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Derrick Stolee 43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment variable. This was created to verify that commit-graph was not loaded when writing a new non-incremental commit-graph. An upcoming change wants to load a commit-graph in some valuable cases, but we want to maintain that we don't trust the commit-graph data when writing our new file. Instead of dying on load, instead die if we ever try to parse a commit from the commit-graph. This functionally verifies the same intended behavior, but allows a more advanced feature in the next change. Signed-off-by: Derrick Stolee --- commit-graph.c | 12 ++++++++---- commit-graph.h | 2 +- t/t5318-commit-graph.sh | 2 +- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index d0fedcd9b1..6a28d4a5a6 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -564,10 +564,6 @@ static int prepare_commit_graph(struct repository *r) return !!r->objects->commit_graph; r->objects->commit_graph_attempted = 1; - if (git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD, 0)) - die("dying as requested by the '%s' variable on commit-graph load!", - GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD); - prepare_repo_settings(r); if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) && @@ -790,6 +786,14 @@ static int parse_commit_in_graph_one(struct repository *r, int parse_commit_in_graph(struct repository *r, struct commit *item) { + static int checked_env = 0; + + if (!checked_env && + git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE, 0)) + die("dying as requested by the '%s' variable on commit-graph parse!", + GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE); + checked_env = 1; + if (!prepare_commit_graph(r)) return 0; return parse_commit_in_graph_one(r, r->objects->commit_graph, item); diff --git a/commit-graph.h b/commit-graph.h index 881c9b46e5..f0fb13e3f2 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -5,7 +5,7 @@ #include "object-store.h" #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH" -#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD" +#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE" #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS" /* diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 1073f9e3cf..5ec01abdaa 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -436,7 +436,7 @@ corrupt_graph_verify() { cp $objdir/info/commit-graph commit-graph-pre-write-test fi && git status --short && - GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD=true git commit-graph write && + GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE=true git commit-graph write && git commit-graph verify } From patchwork Tue Jun 23 17:47:02 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621267 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 194C792A for ; Tue, 23 Jun 2020 17:47:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 00C8B20780 for ; Tue, 23 Jun 2020 17:47:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="GRizELV1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733203AbgFWRrS (ORCPT ); Tue, 23 Jun 2020 13:47:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56856 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733119AbgFWRrQ (ORCPT ); Tue, 23 Jun 2020 13:47:16 -0400 Received: from mail-wm1-x343.google.com (mail-wm1-x343.google.com [IPv6:2a00:1450:4864:20::343]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D8C2AC061755 for ; Tue, 23 Jun 2020 10:47:15 -0700 (PDT) Received: by mail-wm1-x343.google.com with SMTP id o8so2012432wmh.4 for ; Tue, 23 Jun 2020 10:47:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=sSgnGIxsYvHWBit9NpSFFpRWHko+DGmjncngfvE17G0=; b=GRizELV1zXGtgXDArCzh9vhxpWH6BdElvHXzUpuYw3RsRtmFtwlKjnPuR26c1sRUk/ inZVx5XCBxu9GdNYMZXF16WqbhVrI8RTYbNG+RY77F3CZVT2bGWVO516rgsBIXAMAghO wbuVwWlyXG8PR3YGBoS5UtO8J0ByG01/CtrtI9n2a/PC83m0jRoqIGSWrUmMbXpxoSnZ /9A6LMfytiwZXOesAGouUdy4vmqiIQ7TejTLCE+fU2MvBgIRuThhb9DLHCCj9uJiBNwq HNikp03zYrdBtpl3m4vIlchOUpYPKEKupHeOqBcoMgtiv64+zBIksxeiPN+CXW+Eng/I oOog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=sSgnGIxsYvHWBit9NpSFFpRWHko+DGmjncngfvE17G0=; b=rwhpWf9b2sA/UeBdnBAwkB3jDf3elvV0zjGmbSVxNere3IoQG3QbYkl6H+oWmBj2wk zy4ZhUa1xduXlDzMwU49mYLISTm9dO+F9OIwiqBCg86MhtsX5icXYOI3uGF09FQIgYSm UZ7OtVdGc7CBWufkvRZrpkDaN8h0PN8yRf++g7EqhR4YMQMqudT9J75qWGkBPHLX+dGQ f+N8GbM2o6bOfUE7hhH3ux70TQxdyCpj4Vz0Gkz6wsqPLz4HJzH+7TS4elVEnxcP0KO0 nUKc2WANta8d1kSTqcgtLKmD21uvV9Pa7FEPwfu9otXYFYcg26tWpQMXyrs4aGGWpB53 DGGw== X-Gm-Message-State: AOAM531GUo0UlgF6FKGG/gdWp480jB2oXvGvfPnhsQlX5tFQSwKDR41L zDzLwDdGECbjKj2Lp34UEq5wttZl X-Google-Smtp-Source: ABdhPJwXEiE+LlAFNCuThPttOyp7L97plwUNVPvn7JPkoqu7ZZy7adhtsYMN7QMhSHZFAKrhg/9Dbw== X-Received: by 2002:a7b:cbc5:: with SMTP id n5mr11874663wmi.95.1592934434490; Tue, 23 Jun 2020 10:47:14 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id o29sm15941117wra.5.2020.06.23.10.47.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:14 -0700 (PDT) Message-Id: <492deaf916464abedc7fc2d3de41fe690a558d9b.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Derrick Stolee via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:02 +0000 Subject: [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , Derrick Stolee Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Derrick Stolee The get_bloom_filter() method is a bit complicated in some parts where it does not need to be. In particular, it needs to return a NULL filter only when compute_if_not_present is zero AND the filter data cannot be loaded from a commit-graph file. This currently happens by accident because the commit-graph does not load changed-path Bloom filters from an existing commit-graph when writing a new one. This will change in a later patch. Also clean up some style issues while we are here. Signed-off-by: Derrick Stolee --- bloom.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/bloom.c b/bloom.c index c38d1cff0c..7291506564 100644 --- a/bloom.c +++ b/bloom.c @@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, struct diff_options diffopt; int max_changes = 512; - if (bloom_filters.slab_size == 0) + if (!bloom_filters.slab_size) return NULL; filter = bloom_filter_slab_at(&bloom_filters, c); @@ -194,16 +194,15 @@ struct bloom_filter *get_bloom_filter(struct repository *r, if (!filter->data) { load_commit_graph_info(r, c); if (c->graph_pos != COMMIT_NOT_FROM_GRAPH && - r->objects->commit_graph->chunk_bloom_indexes) { - if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c)) - return filter; - else - return NULL; - } + r->objects->commit_graph->chunk_bloom_indexes && + load_bloom_filter_from_graph(r->objects->commit_graph, filter, c)) + return filter; } - if (filter->data || !compute_if_not_present) + if (filter->data) return filter; + if (!filter->data && !compute_if_not_present) + return NULL; repo_diff_setup(r, &diffopt); diffopt.flags.recursive = 1; From patchwork Tue Jun 23 17:47:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621271 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4A5AE14F6 for ; Tue, 23 Jun 2020 17:47:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2C42E20774 for ; Tue, 23 Jun 2020 17:47:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UyGuk9+k" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733250AbgFWRrU (ORCPT ); Tue, 23 Jun 2020 13:47:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56860 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733156AbgFWRrR (ORCPT ); Tue, 23 Jun 2020 13:47:17 -0400 Received: from mail-wm1-x343.google.com (mail-wm1-x343.google.com [IPv6:2a00:1450:4864:20::343]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CCD1FC061795 for ; Tue, 23 Jun 2020 10:47:16 -0700 (PDT) Received: by mail-wm1-x343.google.com with SMTP id f139so2742677wmf.5 for ; Tue, 23 Jun 2020 10:47:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=sYeRzy3NN5UMY3/v2uT3PQdy+N59M4OW78Nyx9IeK/U=; b=UyGuk9+ki/Zz9Jd2psbJD7Si8WTjt7qujXhmJz9lq2mCLeTgYfFSL3J+bvSiHlno5y WVGWthCl3CZn4Ytj4BlJeKdV0ZD1LoMRLAxZyDnfCZK2dyCO8dQSYGDHjanhJqcCc4XB Sr3IJYCFJoi6gJdZzqjbQ3DbdNCkNtaLQ9aRAUiM/UMqpfSEx+QfqZeaGZFqc25h/v9F CKsZN4nRczGgGAX6iCAugjPFDBFpWfldlcUJn7fe+EzXFN6VLUhrOPMdkGXfhE+LXH4Q qS+AzINtA8o/Jejvf2BG2m2wZXYio6ZV1nhrxvkpLTzWmi24ITkt2dnybXS20yTfE816 wIUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=sYeRzy3NN5UMY3/v2uT3PQdy+N59M4OW78Nyx9IeK/U=; b=nK6du51JNPA/sQv1UDHvzCeWbsaAbS4WD5tVHVsIx+AOuT758OlbyPXNVwIZzuKu4B pQiD6CFqMijLkUmDQdMkU4PGhSjQRBk0zt0g8NkzUSt0BBbXskGAcMWXdUO7sfKd2a3O Nd/ABunmxA+gHRztnauGPkRj0YOq5lohq8BWy504ImeFP6GOfYzOl7SkXDPXNP3If0fj oAfU0i/O0lO66+PqeJ+cWeTSaloqOX5emoSqQzB1Luw2if7/w9OJFGCQhnaVu+AoaHSr grhqM8dKGWk/t2A5R7q5IYYBu60txMMzeW4mfrfCePFb5EoRo4GUCqF2zCcwtvkJef3C 2+fA== X-Gm-Message-State: AOAM532t73m8NBhtPeXlrVWssLc4kVFWYJbe0VERZP6dOxAkZq5ZMCGm Sx6OU7+RP1ciJwA22UppYl8UyD73 X-Google-Smtp-Source: ABdhPJwGllYqp0KxkhO81M4CXn38Nb5a0A3DS9or+nVIGTlY0+QIBUxcVgsIC3LGIQ885O3CBvCMZQ== X-Received: by 2002:a1c:2049:: with SMTP id g70mr21407032wmg.90.1592934435294; Tue, 23 Jun 2020 10:47:15 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id r1sm15467594wrt.73.2020.06.23.10.47.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:14 -0700 (PDT) Message-Id: <8727b254680fdbc4f42f8b46c6ec686e7fa33fbe.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Derrick Stolee via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:03 +0000 Subject: [PATCH v2 04/11] commit-graph: persist existence of changed-paths Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , Derrick Stolee Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Derrick Stolee The changed-path Bloom filters were released in v2.27.0, but have a significant drawback. A user can opt-in to writing the changed-path filters using the "--changed-paths" option to "git commit-graph write" but the next write will drop the filters unless that option is specified. This becomes even more important when considering the interaction with gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of features.experimental). These config options trigger commit-graph writes that the user did not signal, and hence there is no --changed-paths option available. Allow a user that opts-in to the changed-path filters to persist the property of "my commit-graph has changed-path filters" automatically. A user can drop filters using the --no-changed-paths option. In the process, we need to be extremely careful to match the Bloom filter settings as specified by the commit-graph. This will allow future versions of Git to customize these settings, and the version with this change will persist those settings as commit-graphs are rewritten on top. Use the trace2 API to signal the settings used during the write, and check that output in a test after manually adjusting the correct bytes in the commit-graph file. Signed-off-by: Derrick Stolee --- Documentation/git-commit-graph.txt | 5 +++- builtin/commit-graph.c | 5 +++- commit-graph.c | 38 ++++++++++++++++++++++++++++-- commit-graph.h | 1 + t/t4216-log-bloom.sh | 29 ++++++++++++++++++++++- 5 files changed, 73 insertions(+), 5 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index f4b13c005b..369b222b08 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -60,7 +60,10 @@ existing commit-graph file. With the `--changed-paths` option, compute and write information about the paths changed between a commit and it's first parent. This operation can take a while on large repositories. It provides significant performance gains -for getting history of a directory or a file with `git log -- `. +for getting history of a directory or a file with `git log -- `. If +this option is given, future commit-graph writes will automatically assume +that this option was intended. Use `--no-changed-paths` to stop storing this +data. + With the `--split` option, write the commit-graph as a chain of multiple commit-graph files stored in `/info/commit-graphs`. The new commits diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 59009837dc..ff7b177c33 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -151,6 +151,7 @@ static int graph_write(int argc, const char **argv) }; opts.progress = isatty(2); + opts.enable_changed_paths = -1; split_opts.size_multiple = 2; split_opts.max_commits = 0; split_opts.expire_time = 0; @@ -171,7 +172,9 @@ static int graph_write(int argc, const char **argv) flags |= COMMIT_GRAPH_WRITE_SPLIT; if (opts.progress) flags |= COMMIT_GRAPH_WRITE_PROGRESS; - if (opts.enable_changed_paths || + if (!opts.enable_changed_paths) + flags |= COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS; + if (opts.enable_changed_paths == 1 || git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0)) flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS; diff --git a/commit-graph.c b/commit-graph.c index 6a28d4a5a6..908f094271 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -16,6 +16,8 @@ #include "progress.h" #include "bloom.h" #include "commit-slab.h" +#include "json-writer.h" +#include "trace2.h" void git_test_write_commit_graph_or_die(void) { @@ -1107,6 +1109,21 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f, stop_progress(&progress); } +static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx) +{ + struct json_writer jw = JSON_WRITER_INIT; + + jw_object_begin(&jw, 0); + jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version); + jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes); + jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry); + jw_end(&jw); + + trace2_data_json("bloom", ctx->r, "settings", &jw); + + jw_release(&jw); +} + static void write_graph_chunk_bloom_data(struct hashfile *f, struct write_commit_graph_context *ctx) { @@ -1115,6 +1132,8 @@ static void write_graph_chunk_bloom_data(struct hashfile *f, struct progress *progress = NULL; int i = 0; + trace2_bloom_filter_settings(ctx); + if (ctx->report_progress) progress = start_delayed_progress( _("Writing changed paths Bloom filters data"), @@ -1545,7 +1564,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) struct object_id file_hash; const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS; - ctx->bloom_settings = &bloom_settings; + if (!ctx->bloom_settings) + ctx->bloom_settings = &bloom_settings; if (ctx->split) { struct strbuf tmp_file = STRBUF_INIT; @@ -1970,9 +1990,23 @@ int write_commit_graph(struct object_directory *odb, ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0; ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0; ctx->split_opts = split_opts; - ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0; ctx->total_bloom_filter_data_size = 0; + if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS) + ctx->changed_paths = 1; + if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) { + struct commit_graph *g; + prepare_commit_graph_one(ctx->r, ctx->odb); + + g = ctx->r->objects->commit_graph; + + /* We have changed-paths already. Keep them in the next graph */ + if (g && g->chunk_bloom_data) { + ctx->changed_paths = 1; + ctx->bloom_settings = g->bloom_filter_settings; + } + } + if (ctx->split) { struct commit_graph *g; prepare_commit_graph(ctx->r); diff --git a/commit-graph.h b/commit-graph.h index f0fb13e3f2..45b1e5bca3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -96,6 +96,7 @@ enum commit_graph_write_flags { /* Make sure that each OID in the input is a valid commit OID. */ COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3), COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4), + COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS = (1 << 5), }; struct split_commit_graph_opts { diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index c7011f33e2..426de10041 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -126,7 +126,7 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters test_commit c14 A/anotherFile2 && test_commit c15 A/B/anotherFile2 && test_commit c16 A/B/C/anotherFile2 && - GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split && + git commit-graph write --reachable --split --no-changed-paths && test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain ' @@ -152,4 +152,31 @@ test_expect_success 'Use Bloom filters if they exist in the latest but not all c test_bloom_filters_used_when_some_filters_are_missing "-- A/B" ' +BASE_BDAT_OFFSET=2240 +BASE_K_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 10)) +BASE_LEN_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 14)) + +corrupt_graph() { + pos=$1 + data="${2:-\0}" + grepstr=$3 + orig_size=$(wc -c < .git/objects/info/commit-graph) && + zero_pos=${4:-${orig_size}} && + printf "$data" | dd of=".git/objects/info/commit-graph" bs=1 seek="$pos" conv=notrunc && + dd of=".git/objects/info/commit-graph" bs=1 seek="$zero_pos" if=/dev/null +} + +test_expect_success 'persist filter settings' ' + test_when_finished rm -rf .git/objects/info/commit-graph* && + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths && + grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt && + cp .git/objects/info/commit-graph commit-graph-before && + corrupt_graph $BASE_K_BYTE_OFFSET "\09" && + corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" && + cp .git/objects/info/commit-graph commit-graph-after && + test_commit c18 A/corrupt && + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths && + grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt +' + test_done \ No newline at end of file From patchwork Tue Jun 23 17:47:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621273 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 16FA492A for ; Tue, 23 Jun 2020 17:47:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E9B9D20780 for ; Tue, 23 Jun 2020 17:47:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Ag5CLgkC" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733258AbgFWRrV (ORCPT ); Tue, 23 Jun 2020 13:47:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56864 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733167AbgFWRrS (ORCPT ); Tue, 23 Jun 2020 13:47:18 -0400 Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com [IPv6:2a00:1450:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A7FA1C061573 for ; Tue, 23 Jun 2020 10:47:17 -0700 (PDT) Received: by mail-wr1-x444.google.com with SMTP id q5so9059803wru.6 for ; Tue, 23 Jun 2020 10:47:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=As+XaR3i/7WhEvBBuzPQae7jR5XZACb2XpAm/niysU0=; b=Ag5CLgkCnOOWQh7fFdXGNp8vZvSJ47fCWdausMOo/csgv9+s822IDCL/9w4YEQI2VY Tmlsl8ebHnY4HyMdeAWSH3e5yMeo/zGxiL4Vl0v8dCeCUq5YEkt1gasBkH6oFY5Cm9Nd 70wqUhkUnCjYTgya1iUT/nOkDydZyNuAT/ZPssH6THKMN5xUznZrtEGYbfC5vkYlfWPq TgX9id+No9n5pn2DaiaQ4oR6UlAZOOWtxreMnsLO8BVf/UTIFaqskYa2JOn9dbeuED/G 5Z71IUac/rDMOl4uTuqwoIpQe5YaPWyCjGtJQ2yf3DGszaSEtdZP97XYovte/1AkucEi xEAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=As+XaR3i/7WhEvBBuzPQae7jR5XZACb2XpAm/niysU0=; b=tVH/xwt0tkHbRJERuZ8mY0cWu9tPJdVeLB69WxrMMTV4bgpnGLLr0KCBeX7X3eO+EE EPYrwdv2WUNxwteLmyV3KsSsY3Imjks+R8kapjvjxXBM1CpgM8O2x8FAMEPi978FPlX6 K/id/0xur7OX8dyRzO1EBNlKU4OOtnktCC0UNsupsTHTmBcFJyqlD94k9yY/83Z/B7RV 8saqI2KYGorbejpdkM0RhNj1RLnv/DxPS341Ai9sEF1FcUQDtNOcOw1aLgn2E83Ken9n aVaXlDtyI4D8KC0iid5FgUBimPl0aN4AOK7rI9YjfxJrleYF5Z5pdnyOyvJ3QEUoJHzO vmcQ== X-Gm-Message-State: AOAM530pq1kJBF4eq0/cRdmsjyGrWZk6ZBc41CSTVMGEpxaRQLCvjEBx hFeU0uZtMyJfh93bxspQPEr6z3b1 X-Google-Smtp-Source: ABdhPJzwFpAq9bvj6W0gjr1VGu0ZEIImqH085tAJN+6tZxIjRpcTIB39Zb1LFNl6Z5ghnjkoDi8CXA== X-Received: by 2002:adf:e906:: with SMTP id f6mr18490096wrm.425.1592934436219; Tue, 23 Jun 2020 10:47:16 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id u15sm17467036wrm.64.2020.06.23.10.47.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:15 -0700 (PDT) Message-Id: <244668fec4428872f441d515e570e923808f0db1.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: " =?utf-8?q?SZEDER_G=C3=A1bor?= via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:04 +0000 Subject: [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , =?utf-8?q?SZEDER_G=C3=A1bor?= Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= Update the write_graph_chunk_*() helper functions to have the same signature: - Return an int error code from all these functions. write_graph_chunk_base() already has an int error code, now the others will have one, too, but since they don't indicate any error, they will always return 0. - Drop the hash size parameter of write_graph_chunk_oids() and write_graph_chunk_data(); its value can be read directly from 'the_hash_algo' inside these functions as well. This opens up the possibility for further cleanups and foolproofing in the following two patches. Signed-off-by: SZEDER Gábor Signed-off-by: Derrick Stolee --- commit-graph.c | 42 ++++++++++++++++++++++++++---------------- 1 file changed, 26 insertions(+), 16 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 908f094271..f33bfe49b3 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -891,8 +891,8 @@ struct write_commit_graph_context { const struct bloom_filter_settings *bloom_settings; }; -static void write_graph_chunk_fanout(struct hashfile *f, - struct write_commit_graph_context *ctx) +static int write_graph_chunk_fanout(struct hashfile *f, + struct write_commit_graph_context *ctx) { int i, count = 0; struct commit **list = ctx->commits.list; @@ -913,17 +913,21 @@ static void write_graph_chunk_fanout(struct hashfile *f, hashwrite_be32(f, count); } + + return 0; } -static void write_graph_chunk_oids(struct hashfile *f, int hash_len, - struct write_commit_graph_context *ctx) +static int write_graph_chunk_oids(struct hashfile *f, + struct write_commit_graph_context *ctx) { struct commit **list = ctx->commits.list; int count; for (count = 0; count < ctx->commits.nr; count++, list++) { display_progress(ctx->progress, ++ctx->progress_cnt); - hashwrite(f, (*list)->object.oid.hash, (int)hash_len); + hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz); } + + return 0; } static const unsigned char *commit_to_sha1(size_t index, void *table) @@ -932,8 +936,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table) return commits[index]->object.oid.hash; } -static void write_graph_chunk_data(struct hashfile *f, int hash_len, - struct write_commit_graph_context *ctx) +static int write_graph_chunk_data(struct hashfile *f, + struct write_commit_graph_context *ctx) { struct commit **list = ctx->commits.list; struct commit **last = ctx->commits.list + ctx->commits.nr; @@ -950,7 +954,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, die(_("unable to parse commit %s"), oid_to_hex(&(*list)->object.oid)); tree = get_commit_tree_oid(*list); - hashwrite(f, tree->hash, hash_len); + hashwrite(f, tree->hash, the_hash_algo->rawsz); parent = (*list)->parents; @@ -1030,10 +1034,12 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, list++; } + + return 0; } -static void write_graph_chunk_extra_edges(struct hashfile *f, - struct write_commit_graph_context *ctx) +static int write_graph_chunk_extra_edges(struct hashfile *f, + struct write_commit_graph_context *ctx) { struct commit **list = ctx->commits.list; struct commit **last = ctx->commits.list + ctx->commits.nr; @@ -1082,10 +1088,12 @@ static void write_graph_chunk_extra_edges(struct hashfile *f, list++; } + + return 0; } -static void write_graph_chunk_bloom_indexes(struct hashfile *f, - struct write_commit_graph_context *ctx) +static int write_graph_chunk_bloom_indexes(struct hashfile *f, + struct write_commit_graph_context *ctx) { struct commit **list = ctx->commits.list; struct commit **last = ctx->commits.list + ctx->commits.nr; @@ -1107,6 +1115,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f, } stop_progress(&progress); + return 0; } static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx) @@ -1124,8 +1133,8 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx) jw_release(&jw); } -static void write_graph_chunk_bloom_data(struct hashfile *f, - struct write_commit_graph_context *ctx) +static int write_graph_chunk_bloom_data(struct hashfile *f, + struct write_commit_graph_context *ctx) { struct commit **list = ctx->commits.list; struct commit **last = ctx->commits.list + ctx->commits.nr; @@ -1151,6 +1160,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f, } stop_progress(&progress); + return 0; } static int oid_compare(const void *_a, const void *_b) @@ -1662,8 +1672,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) num_chunks * ctx->commits.nr); } write_graph_chunk_fanout(f, ctx); - write_graph_chunk_oids(f, hashsz, ctx); - write_graph_chunk_data(f, hashsz, ctx); + write_graph_chunk_oids(f, ctx); + write_graph_chunk_data(f, ctx); if (ctx->num_extra_edges) write_graph_chunk_extra_edges(f, ctx); if (ctx->changed_paths) { From patchwork Tue Jun 23 17:47:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621277 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BEE0614F6 for ; Tue, 23 Jun 2020 17:47:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A803420774 for ; Tue, 23 Jun 2020 17:47:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="rUTMCrGP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733276AbgFWRrW (ORCPT ); Tue, 23 Jun 2020 13:47:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56866 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733231AbgFWRrS (ORCPT ); Tue, 23 Jun 2020 13:47:18 -0400 Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com [IPv6:2a00:1450:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5FE3FC061573 for ; Tue, 23 Jun 2020 10:47:18 -0700 (PDT) Received: by mail-wr1-x443.google.com with SMTP id g18so12411558wrm.2 for ; Tue, 23 Jun 2020 10:47:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=3lTgw40VlzHMU5pcjBu2me3VuAGVwZ3m4hYCYgAyiL0=; b=rUTMCrGPCPYD4nK88ILvz1RsAfDduXN8WcAwh0RLFrqanvQH/dF2OagZnMYRIEfsiF WXTqTnEyZYyqFPevcIRngWBBgKvTRflQG+sm49jTRBTz6oZOYFblItnS95+BJU5SVWgZ q4mzZEugPFe27dBsYIHMo62UQN9rMtof41VF6EtblC0h66TzKL2uANpJY0QgCJ1+N5M+ 3TP3T9NHNWh2gohY523NfXoAD3BTuJqT404Rgh02RSFbPR7sEdq+iGL8rYv8yuutDfox CI2RusISwpu8PLmvoeu4ziM80SstVTSvmdpCqxHXfcrCf3JTPx/HHr0Q+9PKbZCWIlxN 0k1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=3lTgw40VlzHMU5pcjBu2me3VuAGVwZ3m4hYCYgAyiL0=; b=thWlK5El84eq/WzzQg+JzFIwKWo33MeeKO/4snbRJvE7v+Y3kuRdjBhFO0sdz1MJHT d3/rVRWgN4aH0bmHVdzAUhYIbjd3Aacdb1fqUD+5BttGvZxSFOrd8u0ORPiC6/M8JMS/ TkgefkiN/RwneezuW+LDy+utpNkOcVUBOpk7RisZIsVqutYayJFtStOeQhygNQltDA/I ctax+3K9+O6+3k7FTtMRCJ5d1cIzHSben5NeP3wxSAXJI07q6mepgvFJubQ3d+g+r8lZ fthXvCleYr2c4BV1IZiEjN/wafwDV7+2LjlajQo0Vj/bYlead7TL1rpcTpimxisBObSf O2nA== X-Gm-Message-State: AOAM530j8T8rg+yEswXy7SH1gQNzvvMPMGd5Xnpe9pTKhpugLoPwmbY3 veYY2YGV+PAO/ZFsCEaIaxX8O0ks X-Google-Smtp-Source: ABdhPJwFKxUZ3fc4/wBpmlYtb8yizTQK3HSpcWxCKXl53Nm46HmFJSvKI64/RP8pI6xlMo+32npV4Q== X-Received: by 2002:a5d:6651:: with SMTP id f17mr13167030wrw.29.1592934437015; Tue, 23 Jun 2020 10:47:17 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id j6sm5111830wmb.3.2020.06.23.10.47.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:16 -0700 (PDT) Message-Id: <8b959f2f374654aeb87b847560761890c2f9aa2c.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: " =?utf-8?q?SZEDER_G=C3=A1bor?= via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:05 +0000 Subject: [PATCH v2 06/11] commit-graph: simplify chunk writes into loop MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , =?utf-8?q?SZEDER_G=C3=A1bor?= Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= In write_commit_graph_file() we now have one block of code filling the array of 'struct chunk_info' with the IDs and sizes of chunks to be written, and an other block of code calling the functions responsible for writing individual chunks. In case of optional chunks like Extra Edge List an Base Graphs List there is also a condition checking whether that chunk is necessary/desired, and that same condition is repeated in both blocks of code. Other, newer chunks have similar optional conditions. Eliminate these repeated conditions by storing the function pointers responsible for writing individual chunks in the 'struct chunk_info' array as well, and calling them in a loop to write the commit-graph file. This will open up the possibility for a bit of foolproofing in the following patch. Signed-off-by: SZEDER Gábor Signed-off-by: Derrick Stolee --- commit-graph.c | 31 +++++++++++++++++++------------ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index f33bfe49b3..086fc2d070 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1555,9 +1555,13 @@ static int write_graph_chunk_base(struct hashfile *f, return 0; } +typedef int (*chunk_write_fn)(struct hashfile *f, + struct write_commit_graph_context *ctx); + struct chunk_info { uint32_t id; uint64_t size; + chunk_write_fn write_fn; }; static int write_commit_graph_file(struct write_commit_graph_context *ctx) @@ -1615,27 +1619,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) chunks[0].id = GRAPH_CHUNKID_OIDFANOUT; chunks[0].size = GRAPH_FANOUT_SIZE; + chunks[0].write_fn = write_graph_chunk_fanout; chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP; chunks[1].size = hashsz * ctx->commits.nr; + chunks[1].write_fn = write_graph_chunk_oids; chunks[2].id = GRAPH_CHUNKID_DATA; chunks[2].size = (hashsz + 16) * ctx->commits.nr; + chunks[2].write_fn = write_graph_chunk_data; if (ctx->num_extra_edges) { chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES; chunks[num_chunks].size = 4 * ctx->num_extra_edges; + chunks[num_chunks].write_fn = write_graph_chunk_extra_edges; num_chunks++; } if (ctx->changed_paths) { chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES; chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr; + chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes; num_chunks++; chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA; chunks[num_chunks].size = sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size; + chunks[num_chunks].write_fn = write_graph_chunk_bloom_data; num_chunks++; } if (ctx->num_commit_graphs_after > 1) { chunks[num_chunks].id = GRAPH_CHUNKID_BASE; chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1); + chunks[num_chunks].write_fn = write_graph_chunk_base; num_chunks++; } @@ -1671,19 +1682,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) progress_title.buf, num_chunks * ctx->commits.nr); } - write_graph_chunk_fanout(f, ctx); - write_graph_chunk_oids(f, ctx); - write_graph_chunk_data(f, ctx); - if (ctx->num_extra_edges) - write_graph_chunk_extra_edges(f, ctx); - if (ctx->changed_paths) { - write_graph_chunk_bloom_indexes(f, ctx); - write_graph_chunk_bloom_data(f, ctx); - } - if (ctx->num_commit_graphs_after > 1 && - write_graph_chunk_base(f, ctx)) { - return -1; + + for (i = 0; i < num_chunks; i++) { + if (chunks[i].write_fn(f, ctx)) { + error(_("failed writing chunk with id %"PRIx32""), + chunks[i].id); + return -1; + } } + stop_progress(&ctx->progress); strbuf_release(&progress_title); From patchwork Tue Jun 23 17:47:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621287 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 42A2892A for ; Tue, 23 Jun 2020 17:47:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2B070206D4 for ; Tue, 23 Jun 2020 17:47:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lABiXFKK" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733212AbgFWRra (ORCPT ); Tue, 23 Jun 2020 13:47:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56872 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733242AbgFWRrT (ORCPT ); Tue, 23 Jun 2020 13:47:19 -0400 Received: from mail-wm1-x344.google.com (mail-wm1-x344.google.com [IPv6:2a00:1450:4864:20::344]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38E24C061573 for ; Tue, 23 Jun 2020 10:47:19 -0700 (PDT) Received: by mail-wm1-x344.google.com with SMTP id a6so2737704wmm.0 for ; Tue, 23 Jun 2020 10:47:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=CGtwHe9y0z0MVNXHg1zh9emF5kH93HRl2YP/jm/T5uc=; b=lABiXFKKkfwACuvb0lhWnn+3lI7m4DFkzdnTgzi+WpDsejAxMahsgtKZ0HIfJnHmrF WWT/qVeFZUIuMhOW1ac7VX2c9f0xOiWR5yLfK5OBhI+YncmPvd6ArCTK938C5Xucu6Uz b5cK5HSWsBOv6UpfKeifzoHC1V5p6UM/sT04UgiGcWeEoXPH1J4vSoX5LL8egV9GkEf4 HjzNuJ8dn12GSj90mMK0mGW8wyGRtkAY5zsHGCNTNlq+045Kxqw/fGjqJtfDBysvuHt7 1dpPcW4CTsSXZOpnre9DlAQiiX7EdMIlOzFbGNeaBn5WYJyJF8MvH6LeDPmVjPol/owU A6fQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=CGtwHe9y0z0MVNXHg1zh9emF5kH93HRl2YP/jm/T5uc=; b=WUPmBGBRDaAGbL8j8R4j6ktkRvZDW6IV+s4Rc+00R2ujIs0KqxMoflNWBjgy5USO6T Ud71JHbQzgEQjF7wnLhmLYIYwnZSptb0rhKKCOqQOao1l5BTHpNqwN889hoBEVJUKI+8 K7j5ZzYCFhy905KQgAnSWz+R39mYUWJS3nDGSvPzKFAptkqkqvqhINGXrsSr7uYF4Yu+ NjKVWphT3+jW3OUgtbpv1i63nEaZPwSXBOTxOIWMrbdoT9CSUskStLkdISX17RE/KLxj IsHAP1R60t8thgdx39ns2P2Fjh3MoXDIHuY44a42jAkr3Ii0GVeUTF0Zw/9ErEDVqrYn un4A== X-Gm-Message-State: AOAM5308rZHyRUFqEJueADUUyNphbVujOodIeraw17lY6Of6B386pUCg yWfk4uNfCU1ohio7QugvPlJD7vxg X-Google-Smtp-Source: ABdhPJyzXXjOVD4nbwV44xVGCovRpVGlZBlrKT6PCFkwwqz+bgv8cfsDcffPXqx2a33SwKA37n4AsA== X-Received: by 2002:a1c:a512:: with SMTP id o18mr15604527wme.101.1592934437806; Tue, 23 Jun 2020 10:47:17 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id u65sm4899873wmg.5.2020.06.23.10.47.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:17 -0700 (PDT) Message-Id: <3eb10933dc8517ab01eff17a179e484068d29724.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: " =?utf-8?q?SZEDER_G=C3=A1bor?= via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:06 +0000 Subject: [PATCH v2 07/11] commit-graph: check chunk sizes after writing MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , =?utf-8?q?SZEDER_G=C3=A1bor?= Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= In my experience while experimenting with new commit-graph chunks, early versions of the corresponding new write_commit_graph_my_chunk() functions are, sadly but not surprisingly, often buggy, and write more or less data than they are supposed to, especially if the chunk size is not directly proportional to the number of commits. This then causes all kinds of issues when reading such a bogus commit-graph file, raising the question of whether the writing or the reading part happens to be buggy this time. Let's catch such issues early, already when writing the commit-graph file, and check that each write_graph_chunk_*() function wrote the amount of data that it was expected to, and what has been encoded in the Chunk Lookup table. Now that all commit-graph chunks are written in a loop we can do this check in a single place for all chunks, and any chunks added in the future will get checked as well. Signed-off-by: SZEDER Gábor Signed-off-by: Derrick Stolee --- commit-graph.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 086fc2d070..1de6800d74 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1683,12 +1683,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) num_chunks * ctx->commits.nr); } + chunk_offset = f->total + f->offset; for (i = 0; i < num_chunks; i++) { + uint64_t end_offset; + if (chunks[i].write_fn(f, ctx)) { error(_("failed writing chunk with id %"PRIx32""), chunks[i].id); return -1; } + + end_offset = f->total + f->offset; + if (end_offset - chunk_offset != chunks[i].size) + BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead", + chunks[i].size, chunks[i].id, end_offset - chunk_offset); + chunk_offset = end_offset; } stop_progress(&ctx->progress); From patchwork Tue Jun 23 17:47:07 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621279 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 60F0260D for ; Tue, 23 Jun 2020 17:47:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 440B420774 for ; Tue, 23 Jun 2020 17:47:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="DfvhXFwA" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733306AbgFWRrY (ORCPT ); Tue, 23 Jun 2020 13:47:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56876 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733253AbgFWRrU (ORCPT ); Tue, 23 Jun 2020 13:47:20 -0400 Received: from mail-wm1-x341.google.com (mail-wm1-x341.google.com [IPv6:2a00:1450:4864:20::341]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E9F7BC061755 for ; Tue, 23 Jun 2020 10:47:19 -0700 (PDT) Received: by mail-wm1-x341.google.com with SMTP id q15so2122967wmj.2 for ; Tue, 23 Jun 2020 10:47:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=c/zMv9Pl8Px8pPuhdc4kPadMp54xcVAZZhck8gDDyHc=; b=DfvhXFwA6LlwvnSTt72akRIxC/5hOOc+rDOEFhog3SDKLPgN+VtreedhFN0zCJUSDM y2rEFh9dzS8hvXywEzIgepmvBac8sE5QXYNyR1R3/P/3bDxBoP18PU09V4hfowDjMCuW jB+8jfySxyF2nKMo+N/U2U++EjtYgiByqKHuLrn8UG7x0E8zv1je8lIIZUrTcl3d/zH+ 2sl8a+6STnVn2OPDDai20DigNXDJQbKS3T93O9U6MeAizyB0xBqREgVUd7KxcuAl12OG JhcP1D/jPwyIF9+bchSamQgBZ91hr2bpu+gDoF4o1ytZomLTs5enryUin38HuwNUfTVJ jHBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=c/zMv9Pl8Px8pPuhdc4kPadMp54xcVAZZhck8gDDyHc=; b=Lah5eyUaE/d3c3l3VkUlKvZh/C5z46+jmxbznep9JZ4beHa03nPEkfszXi1y3P+iC3 BVq2BaZPdvpumQFf7Th7xQcEg9LuLiD7fWvoL7joC9WrCKHqDWeaqUlGziKP3nMWkDI9 lrDfXQo3pWXC8J0t+c9OqRcY3RHmSarDk+KzOUlMqxR8VLfCqrL8N8m5U/r93mmVPSWM knurL0gPrvoi24PmiPhfJLPUioH+JxLYY+iR/DVOecM3yvW9lx9mKnBU/iC72Hy4iGUT Jem/N9jKJHKzFIUPBUjbgrhoqyxHEWJIAAQ1lzQDs8qQC5RWcdjtMN65kDjkUI+RjmZb nmUg== X-Gm-Message-State: AOAM533RgurmKgQptKUe7AQQPQCL7Vo3nx+3pfjQ2HEJvqgJyeEQ/wEv g9OIZylEKo+RqwBYO3V326erchBs X-Google-Smtp-Source: ABdhPJytEh+RO3enMT8HoWLbiIHqVLF24kmw1gDqBEO0OSFk3IZ73fWedaMUI7BoYriLbFX5FSgVSA== X-Received: by 2002:a1c:2e58:: with SMTP id u85mr24970864wmu.123.1592934438583; Tue, 23 Jun 2020 10:47:18 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id p16sm25571185wru.27.2020.06.23.10.47.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:18 -0700 (PDT) Message-Id: <0bcfc1f05126767ab6d6125a2b04d6d7392eb519.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Derrick Stolee via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:07 +0000 Subject: [PATCH v2 08/11] revision.c: fix whitespace Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , Derrick Stolee Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Derrick Stolee Here, four spaces were used instead of tab characters. Reported-by: Taylor Blau Signed-off-by: Derrick Stolee --- revision.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/revision.c b/revision.c index c644c66091..ed59084f50 100644 --- a/revision.c +++ b/revision.c @@ -697,11 +697,11 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) /* remove single trailing slash from path, if needed */ if (pi->match[last_index] == '/') { - path_alloc = xstrdup(pi->match); - path_alloc[last_index] = '\0'; - path = path_alloc; + path_alloc = xstrdup(pi->match); + path_alloc[last_index] = '\0'; + path = path_alloc; } else - path = pi->match; + path = pi->match; len = strlen(path); From patchwork Tue Jun 23 17:47:08 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621281 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AD7DA60D for ; Tue, 23 Jun 2020 17:47:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 94FDC20781 for ; Tue, 23 Jun 2020 17:47:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EnYNLE2w" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733289AbgFWRrZ (ORCPT ); Tue, 23 Jun 2020 13:47:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56884 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733273AbgFWRrW (ORCPT ); Tue, 23 Jun 2020 13:47:22 -0400 Received: from mail-wr1-x42a.google.com (mail-wr1-x42a.google.com [IPv6:2a00:1450:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB6C1C061796 for ; Tue, 23 Jun 2020 10:47:21 -0700 (PDT) Received: by mail-wr1-x42a.google.com with SMTP id l10so21456584wrr.10 for ; Tue, 23 Jun 2020 10:47:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=ylXudJhJHctFwahixARVJ60X1aBkrBQE5e/sNafWwb4=; b=EnYNLE2wqgCA8QZeWUUf+HzR6bmMOQLTyfVeaKr0lryL8SaTLw2NVBo2gk+70HYNEt tTms/5zrBU6RyXANd8tjB9hey6ffApkJk0D63F0rdR04Sqq43xth7a7Owede61T5YWpK bNRR1K6lROHg+3jpD2z8UI0Zc4Mnv0JLPpklsrbbSM25UYAyZegTNLvqqHVUc98cP+bG Lb4gSTe7/gMpJjunlBZqMTjEvFN5Y+l0u30LCntVY3TzZopfd+P6K67sFOx5dRHsJpRE U4pa1gPPGMm80drB9RFqV8oS6/gq5RIOHN5a0VdPpQVn8joQkK6tfVdfr3knD1OY6JNR aFEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=ylXudJhJHctFwahixARVJ60X1aBkrBQE5e/sNafWwb4=; b=rRFF1PN2cy5bwGaYq+zWd+GGu62lJKtH5N5VfRh8tMZ5fx5V97vwK+5jfrcjYRHaHc dXFxIaBMJ0tRoL9Y6zomEtuuW0/6XoqgyvsnjfSOU52YFskYft//L9dXQcPngboPFf5f tV3zCMyP5IUvyEMJOGAfBzG1qj99TCOUbyVQjRkJuGXo8oueZjEHL3DFlC+22UXid0VQ ifWhXe4JiNsIwYFavc8N+JPMUQVp7yt94oCWMHsnsU6fwRsJPEMucaSUlztpus2JnJ4e ByrQPOAlcebobnJmYV2oxsZOjB6EtOdj2LG5eXXrmXb1DfV9x2QW8gmyi84OInR44stO 5+eg== X-Gm-Message-State: AOAM532wlQk5zu0qvFfP88vrUdVlu87F8egRJUvbcQl+fBWjrnNMdKNk rW/6oI08BmQ8XfyZlFAStnDrBDqY X-Google-Smtp-Source: ABdhPJxgUQdOfGFw5jXUjAeInxG+Uulk1hhtJ+5sGE7VuEs5dP7rrG/QyRQRPmN2/Xf1JLQBZ2yBmg== X-Received: by 2002:adf:e6c8:: with SMTP id y8mr27953434wrm.40.1592934439352; Tue, 23 Jun 2020 10:47:19 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id z2sm5057318wmc.2.2020.06.23.10.47.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:18 -0700 (PDT) Message-Id: <719c7091a7a48434ce7534c5b617f4238d96e22d.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Taylor Blau via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:08 +0000 Subject: [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , Taylor Blau Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Taylor Blau The prepare_to_use_bloom_filter() method was not intended to be called on an empty pathspec. However, 'git log -- .' and 'git log' are subtly different: the latter reports all commits while the former will simplify commits that do not change the root tree. This means that the path used to construct the bloom_key might be empty, and that value is not added to the Bloom filter during construction. That means that the results are likely incorrect! To resolve the issue, be careful about the length of the path and stop filling Bloom filters. To be completely sure we do not use them, drop the pointer to the bloom_filter_settings from the commit-graph. That allows our test to look at the trace2 logs to verify no Bloom filter statistics are reported. Signed-off-by: Taylor Blau Signed-off-by: Derrick Stolee --- revision.c | 4 ++++ t/t4216-log-bloom.sh | 4 ++++ 2 files changed, 8 insertions(+) diff --git a/revision.c b/revision.c index ed59084f50..b53377cd52 100644 --- a/revision.c +++ b/revision.c @@ -704,6 +704,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) path = pi->match; len = strlen(path); + if (!len) { + revs->bloom_filter_settings = NULL; + return; + } revs->bloom_key = xmalloc(sizeof(struct bloom_key)); fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index 426de10041..f890cc4737 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -112,6 +112,10 @@ test_expect_success 'git log -- multiple path specs does not use Bloom filters' test_bloom_filters_not_used "-- file4 A/file1" ' +test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' ' + test_bloom_filters_not_used "-- ." +' + test_expect_success 'git log with wildcard that resolves to a single path uses Bloom filters' ' test_bloom_filters_used "-- *4" && test_bloom_filters_used "-- *renamed" From patchwork Tue Jun 23 17:47:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621283 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D7F3B60D for ; Tue, 23 Jun 2020 17:47:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AE73A20781 for ; Tue, 23 Jun 2020 17:47:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="H0yddci6" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387411AbgFWRr1 (ORCPT ); Tue, 23 Jun 2020 13:47:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56882 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733167AbgFWRrW (ORCPT ); Tue, 23 Jun 2020 13:47:22 -0400 Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com [IPv6:2a00:1450:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 92052C061795 for ; Tue, 23 Jun 2020 10:47:21 -0700 (PDT) Received: by mail-wr1-x444.google.com with SMTP id b6so21445007wrs.11 for ; Tue, 23 Jun 2020 10:47:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=/GiuafEO77xTxqJ2xUQlt56s9yw/zLOVIZLgSZpt8aE=; b=H0yddci6AuVZDiN2LWjMSmmv8ehXlQ5lDRhXVohwHjehx+3k5bwb2qwTNs8AdhFAAJ a2cwXjKS9chl713S0bFD/lud0jxiN+FOkhMGtNM1f10W8iUKOiZpkepJLA1XGiHIx4nX LrIsDey0vnaUYlh+wyM1qEZogczbf7Ej9hPBDLKk6w7FOtZyLuwDPHRtApZDYrxd7UrM xryThzQsiUpqaKakRvOjG/7+bqfuGpMyiNnhrmwXsZP7iAnYxaqgCTL7Qx/Ab6vW1KkW 3Mu7rdOvoU3VOcm5l7+RPwR7i6J/WRGXcosGAAhtoPYTmJwjlj0VC12Sg6T8z4ehVCTS 8GkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=/GiuafEO77xTxqJ2xUQlt56s9yw/zLOVIZLgSZpt8aE=; b=SjdNBI0U3cKCjvpfYoSvX578gNkrP2Inh6EvZ0FiVhCR4TZ1Gx35gzPFVQjis4VLI6 BvMnA6HHkYOeQIZ0fqFh3IpRizd2Rj0TRV9zdO8As64lEI/fHAZ3915kWT8fz8DDIhAx 47yftw4/C5fT6Kmn8nsv53gb2+ivwBy9whJLDIr9vgtvmhWPjEg91TCXDu32hpTJHeuA E8b4ImXbdev4FP0KGLm5cMcdogQKkSwHdOb6fB4XQjFHjAu1KNKzMgLtvcmN1j8rQsiQ WznJLA6x3hlUFREa6+MCAHktvjWhb3Cjhr/SjKbfzGqIkpCYPXx++kPtRE9e0MhxKAUZ bw3Q== X-Gm-Message-State: AOAM5314vazLHv61cbJIS2GbKCqL9PRaBoLGHCGf/L+YRaRpQKrKYE4L b1HA5eaMuQ7GPUVaFCt65UIzIhOM X-Google-Smtp-Source: ABdhPJzGmRKmVY9VcppLOttNagHyKCDoMuKxr7Ri0lEesI7YYEFoElK0pilinhBUTZBtGvxUPZgV6Q== X-Received: by 2002:adf:dd83:: with SMTP id x3mr24955208wrl.292.1592934440113; Tue, 23 Jun 2020 10:47:20 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id u10sm4573793wml.29.2020.06.23.10.47.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:19 -0700 (PDT) Message-Id: <9c2076b4ce46918fce8f05e609b057611ec56e13.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: " =?utf-8?q?SZEDER_G=C3=A1bor?= via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:09 +0000 Subject: [PATCH v2 10/11] commit-graph: check all leading directories in changed path Bloom filters MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , =?utf-8?q?SZEDER_G=C3=A1bor?= Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= The file 'dir/subdir/file' can only be modified if its leading directories 'dir' and 'dir/subdir' are modified as well. So when checking modified path Bloom filters looking for commits modifying a path with multiple path components, then check not only the full path in the Bloom filters, but all its leading directories as well. Take care to check these paths in "deepest first" order, because it's the full path that is least likely to be modified, and the Bloom filter queries can short circuit sooner. This can significantly reduce the average false positive rate, by about an order of magnitude or three(!), and can further speed up pathspec-limited revision walks. The table below compares the average false positive rate and runtime of git rev-list HEAD -- "$path" before and after this change for 5000+ randomly* selected paths from each repository: Average false Average Average positive rate runtime runtime before after before after difference ------------------------------------------------------------------ git 3.220% 0.7853% 0.0558s 0.0387s -30.6% linux 2.453% 0.0296% 0.1046s 0.0766s -26.8% tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2% *Path selection was done with the following pipeline: git ls-tree -r --name-only HEAD | sort -R | head -n 5000 The improvements in runtime are much smaller than the improvements in average false positive rate, as we are clearly reaching diminishing returns here. However, all these timings depend on that accessing tree objects is reasonably fast (warm caches). If we had a partial clone and the tree objects had to be fetched from a promisor remote, e.g.: $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ commit-graph write --reachable $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ rev-list HEAD -- "$path" then checking all leading path component can reduce the runtime from over an hour to a few seconds (and this is with the clone and the promisor on the same machine). This adjusts the tracing values in t4216-log-bloom.sh, which provides a concrete way to notice the improvement. Helped-by: Taylor Blau Helped-by: René Scharfe Signed-off-by: SZEDER Gábor Signed-off-by: Derrick Stolee --- revision.c | 41 ++++++++++++++++++++++++++++++++--------- revision.h | 6 ++++-- t/t4216-log-bloom.sh | 2 +- 3 files changed, 37 insertions(+), 12 deletions(-) diff --git a/revision.c b/revision.c index b53377cd52..077888ee51 100644 --- a/revision.c +++ b/revision.c @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) { struct pathspec_item *pi; char *path_alloc = NULL; - const char *path; + const char *path, *p; int last_index; - int len; + size_t len; + int path_component_nr = 1; if (!revs->commits) return; @@ -709,8 +710,28 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) return; } - revs->bloom_key = xmalloc(sizeof(struct bloom_key)); - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); + p = path; + while (*p) { + if (is_dir_sep(*p)) + path_component_nr++; + p++; + } + + revs->bloom_keys_nr = path_component_nr; + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); + + fill_bloom_key(path, len, &revs->bloom_keys[0], + revs->bloom_filter_settings); + path_component_nr = 1; + + p = path + len - 1; + while (p > path) { + if (is_dir_sep(*p)) + fill_bloom_key(path, p - path, + &revs->bloom_keys[path_component_nr++], + revs->bloom_filter_settings); + p--; + } if (trace2_is_enabled() && !bloom_filter_atexit_registered) { atexit(trace2_bloom_filter_statistics_atexit); @@ -724,7 +745,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, struct commit *commit) { struct bloom_filter *filter; - int result; + int result = 1, j; if (!revs->repo->objects->commit_graph) return -1; @@ -744,9 +765,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, return -1; } - result = bloom_filter_contains(filter, - revs->bloom_key, - revs->bloom_filter_settings); + for (j = 0; result && j < revs->bloom_keys_nr; j++) { + result = bloom_filter_contains(filter, + &revs->bloom_keys[j], + revs->bloom_filter_settings); + } if (result) count_bloom_filter_maybe++; @@ -786,7 +809,7 @@ static int rev_compare_tree(struct rev_info *revs, return REV_TREE_SAME; } - if (revs->bloom_key && !nth_parent) { + if (revs->bloom_keys_nr && !nth_parent) { bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); if (bloom_ret == 0) diff --git a/revision.h b/revision.h index 7c026fe41f..abbfb4ab59 100644 --- a/revision.h +++ b/revision.h @@ -295,8 +295,10 @@ struct rev_info { struct topo_walk_info *topo_walk_info; /* Commit graph bloom filter fields */ - /* The bloom filter key for the pathspec */ - struct bloom_key *bloom_key; + /* The bloom filter key(s) for the pathspec */ + struct bloom_key *bloom_keys; + int bloom_keys_nr; + /* * The bloom filter settings used to generate the key. * This is loaded from the commit-graph being used. diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index f890cc4737..84f95972ca 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -146,7 +146,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' ' test_bloom_filters_used_when_some_filters_are_missing () { log_args=$1 - bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6" + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8" setup "$log_args" && grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom From patchwork Tue Jun 23 17:47:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 11621285 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0A5C892A for ; Tue, 23 Jun 2020 17:47:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E4E7820774 for ; Tue, 23 Jun 2020 17:47:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LXuaXvoQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387413AbgFWRr2 (ORCPT ); Tue, 23 Jun 2020 13:47:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733291AbgFWRrX (ORCPT ); Tue, 23 Jun 2020 13:47:23 -0400 Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com [IPv6:2a00:1450:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7CC18C061797 for ; Tue, 23 Jun 2020 10:47:22 -0700 (PDT) Received: by mail-wr1-x443.google.com with SMTP id v3so13970126wrc.1 for ; Tue, 23 Jun 2020 10:47:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=lP7hNQKjY2MPQgEG4Aw4NOT5/3I5B5AgUnZQzdjy8Sc=; b=LXuaXvoQ/z7rfB2MjBdzrc+ti7L1uhCqNOildKTCO2nxtUiUkfWf/0kMEWWPp9yODx uC9GND/Rld3JKhTnCAxjwF91EWz5lMrFzJv+kihmCPUmNnb7MS0wH9myvRakPV8e7b8j v0Gnkhe5LGRP7DHyMyDHCk9/im8PvISMp6oSf9A3yloNl2vcXlDJ0gQN0h93NoWJlPjz MXlqYcD/tVdnH+xSEH6hvpZtkX4F07LGzvAomEMqCvPCAVzVTAahYiZkHK+LuLFpY5v8 ikHPfSPcVQPRuy19Kq/EvYlzsteVAcRTNKQour8MESGvB8VPguSYRiOMqDuhyWXzkZIO qkew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=lP7hNQKjY2MPQgEG4Aw4NOT5/3I5B5AgUnZQzdjy8Sc=; b=IOSft3wz5Yaa9VBK4ZaEfR+xPmJAE14AtnQvyv0nK247+v6PMkAW+7ok5PS5Kc7W2H rgoW8irrJaNdS14dEUOYQ3JKdzODMfoJDlmnv9mZMD115osGqvqlC/KCryUC2Em3U6Wv 7kXPZichOilisXs/0YSWQgjTkG4Bhw8Kxc2+urrVX8EQ/851xIKtnysP2MZqRj3LvweQ xtyhDl1L8SqHDDWjessKRxGonsIliTiuIx9PiUChtUEPe6SshuDhTsSm3E+1Eo470Cqr YLqYvthOJgB+bWrgwLvQIahHBNVUBenRo0qvLEvT4936eUz0COq7oAp13RbPykD9ViK0 X7Vw== X-Gm-Message-State: AOAM531YQsJIpPig0rpf2klDLlAJGfJsi1azH552mIWqhZ98C1lOxHE9 N/u2M/LHhf89ND7/AzqHn3KGUKfH X-Google-Smtp-Source: ABdhPJxQdPKsIpiPwdccTrhfu3V7DcdLVzeykzpdzC2sczZY49d3IZ/OqE/ZIS9rjBKKmraJiQxKFw== X-Received: by 2002:adf:f707:: with SMTP id r7mr14190861wrp.70.1592934440981; Tue, 23 Jun 2020 10:47:20 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id i17sm17473120wrc.34.2020.06.23.10.47.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 10:47:20 -0700 (PDT) Message-Id: <1022c0ad21379dae825ca8e47b3da8b62c0b59dc.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Derrick Stolee via GitGitGadget" Date: Tue, 23 Jun 2020 17:47:10 +0000 Subject: [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee , Derrick Stolee Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Derrick Stolee The original design of changed-path Bloom filters included an 8-byte block size for filter lengths. This was changed mid-way through the submission process, and now the length stored in the commit-graph has one-byte granularity. This can cause some issues for very small filters. The analysis for false positive rates assume large filters, so rounding errors become less important at that scale. When there are only a few paths changed, a filter that has size only a few bytes could have very different behavior. In fact, this is evidenced in the Git repository due to the code organization and careful patch creation that leads to many commits with very small filters. These small filters frequently have false-positive rates in the 8-10% range or higher. The previous change improved the false-positive rate using multiple Bloom keys when the path has multiple directory components. However, that does not help at all for files at root. It is typical to have several commits that change only the README at root, and those commits would be likely to have these artificially high false-positive rates. Correct this issue by creating a minimum filters size of 8 bytes. This requires the very small commits (with fewer than six changes, including non-root directories) to have a larger filter. In principle, this violates the bits_per_entry value of struct bloom_filter_settings. However, it does not actually create a functional problem. As for compatibility, this only affects new versions writing filters for commits that do not yet have a filter. Old version will write the smaller filters and this version will persist and properly read that data. Now, the new files will be generated slightly larger. Bytes before Bytes after Difference -------------------------------------------------- git 4,021,078 4,275,311 +6.32% linux 72,212,101 73,909,286 +2.35% tensorflow 7,596,359 7,691,646 +1.25% This has a measurable improvement in the false-positive rate and the end-to-end run time for these repos. The table below compares the average false-positive rate and runtime of git rev-list HEAD -- "$path" before and after this change for 5000+ randomly* selected paths from each repository: Average false Average Average positive rate runtime runtime before after before after difference ------------------------------------------------------------------ git 0.786% 0.227% 0.0387s 0.0289s -25.5% linux 0.0296% 0.0174% 0.0766s 0.0706s -7.8% tensorflow 0.6977% 0.0268% 0.0420s 0.0384s -8.5% *Path selection was done with the following pipeline: git ls-tree -r --name-only HEAD | sort -R | head -n 5000 These relatively-small increases in file size appear to be a fair price to pay for these performance improvements. Signed-off-by: Derrick Stolee --- bloom.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/bloom.c b/bloom.c index 7291506564..e9dc15976c 100644 --- a/bloom.c +++ b/bloom.c @@ -257,6 +257,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r, } filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD; + + if (filter->len && filter->len < 8) + filter->len = 8; + filter->data = xcalloc(filter->len, sizeof(unsigned char)); hashmap_for_each_entry(&pathmap, &iter, e, entry) {