From patchwork Mon Apr 6 16:59:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475901 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1ABAD1871 for ; Mon, 6 Apr 2020 17:00:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DECFC21841 for ; Mon, 6 Apr 2020 17:00:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="tS9F81Mz" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729724AbgDFRAA (ORCPT ); Mon, 6 Apr 2020 13:00:00 -0400 Received: from mail-ed1-f68.google.com ([209.85.208.68]:33340 "EHLO mail-ed1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729574AbgDFRAA (ORCPT ); Mon, 6 Apr 2020 13:00:00 -0400 Received: by mail-ed1-f68.google.com with SMTP id z65so420402ede.0 for ; Mon, 06 Apr 2020 09:59:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=FQfXpkTVI6kPC3l2uN2S1GBJYi0tfK3nyp95t99f4tM=; b=tS9F81MzUFnPOi1P2Dz+MrEygk36YoBE7VJCvwd1YP39xmFRwdI9pk3tj9TlowCZmD B90k/FMFIt/H0VY8Az8bxT8KqXR6GONNVxlL4Pn/HZRBeoPvKT92+S5XO1E1fifAV1Ky SGlqOOHesC7RQAeeXI9HTh6nHdqo4o+r1EdwMk0MPcl1ianHBpk9PQHgKe3jRIty+4Q1 0aGvbl15tJQomImtDorPcWUxLGYLmeDqx0EuQ3OPpKSNV1gw7CECWGJt3ZEhzoeJ+1KB CforvqjY3B+TzLx7E6eHMyLdVVMwtt4z3AdBn+Ft/0bl1KA3f61WUz3le7Gfevk+5YXD YRiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=FQfXpkTVI6kPC3l2uN2S1GBJYi0tfK3nyp95t99f4tM=; b=o4WKefCTo/u7cQ+OasK8wREqVAA6mEVbFlfqCw8FWThjoe2PepcYzLIYY8AS2wTWnQ TYyg8Pe2JCJXEJorWdaYxwztjJ6X7VlLgFwabqlYMEcc5MI3UdxFYuCw7pP5viLrKNLP Xxbqlb1+obb4KVtd5whbaYz2msGXZVysfT73/8Y1ZuRKPeQEWVS2mi/A88LuP/Kqowgs yaZB6y59nxBrxC7+1qgYKUHMwRog32RSq8dTNuavRCkfzhHvduPVYQtjoNoPdTqdwaad nIIA5/Dy7Q1fsUFCJtwK3BLLKrzRMH9nD2iPPUwa3/QajOUFUOboMSuKKYUor6ZdSOhW NeDA== X-Gm-Message-State: AGi0Puai4ECxl6nr19+nViZzOBoJZaxY/+ssN7bU3kdqoETvUQycYBiQ PUjSIUTSTf6d8zifdFUGM4JpbzGk X-Google-Smtp-Source: APiQypJXmWHpXjqfPYEWmFm9V8vWs18PbrwgVRq8M8lmG7EdhttG13aYm4Yy4Um0+tBMHfeE710+/A== X-Received: by 2002:a05:6402:44e:: with SMTP id p14mr20507181edw.356.1586192397437; Mon, 06 Apr 2020 09:59:57 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id h14sm3007564ejt.1.2020.04.06.09.59.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 09:59:57 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:41 +0000 Subject: [PATCH v4 01/15] commit-graph: define and use MAX_NUM_CHUNKS MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh This is a minor cleanup to make it easier to change the number of chunks being written to the commit graph. Reviewed-by: Jakub Narębski Signed-off-by: Garima Singh --- commit-graph.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index f013a84e294..e4f1a5b2f1a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -23,6 +23,7 @@ #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */ #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */ +#define MAX_NUM_CHUNKS 5 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16) @@ -1350,8 +1351,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) int fd; struct hashfile *f; struct lock_file lk = LOCK_INIT; - uint32_t chunk_ids[6]; - uint64_t chunk_offsets[6]; + uint32_t chunk_ids[MAX_NUM_CHUNKS + 1]; + uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1]; const unsigned hashsz = the_hash_algo->rawsz; struct strbuf progress_title = STRBUF_INIT; int num_chunks = 3; From patchwork Mon Apr 6 16:59:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475905 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6BEAB92C for ; Mon, 6 Apr 2020 17:00:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3F716224F9 for ; Mon, 6 Apr 2020 17:00:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="tXDSgnyo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729738AbgDFRAD (ORCPT ); Mon, 6 Apr 2020 13:00:03 -0400 Received: from mail-ed1-f50.google.com ([209.85.208.50]:34284 "EHLO mail-ed1-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729534AbgDFRAB (ORCPT ); Mon, 6 Apr 2020 13:00:01 -0400 Received: by mail-ed1-f50.google.com with SMTP id o1so412660edv.1 for ; Mon, 06 Apr 2020 09:59:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=RQ13dTRqXt/tBsLYEWoAP2r9PckPNmWwxMgCAFExKr0=; b=tXDSgnyoORwFcBmjaKKfAoL/mfaGszAEnXbdj095GhWuM6U19gEi8bPMvZo47PkZ/M D0hzNsgDDe+jCRlxu7ikyKJI+i19eRgqmg6mEZoDlqia/M6WVjd746pGX7jt997RzRWa lZsNARt5D4vMGRv2zqhCLduW3L2BR48J2Ufu2WxBo/m9gr0fHw2oGp1KFz9rv6jqHfsH LGRvwhBUqz9SwhBJUtVA4RNOfgN/aitA0DHISOwPdp2siLCxQemQ0OCvIATlNgyVe9ac n1nQNlnlHdNCuJ1bchJDAGp/kYXXh1F8c9T8lygq6/k7CxVZ193EYttqNRUfbDEjt727 RYow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=RQ13dTRqXt/tBsLYEWoAP2r9PckPNmWwxMgCAFExKr0=; b=IpxZM0bCllrxJwp/b9LLKa4r285CZujHJCV5v+8bS8qjyNNbn8++zR8aGDrVgRYIZ3 A8UH8S6moSw4PgWz77LdHZwKzMifFqykWC68Nq47Gfu/tpkLri76b3xZPwy+yxbFnbsQ qy63p+4SVM658pb6zSkdx3ge9dNKDJUXkZlJ24ZViz+0BsV6JTDXjOV4JxCDm2RLHJmx HTPMu3ZBTyCadUCUzblW7yLgrOTKymyHh9LzhdAy5dCawUtayHjRvrB0x7roE7QUeBsR e2QohvE+gkfXbVTTHH2ftv8Iw+jSOFrmhN6g85qgclIv7nO1OZbZZVdG8ZBTzHYObRBD PVPQ== X-Gm-Message-State: AGi0PuZR0eo+pPq7g+jH931Y3lBwzwjUVmIsOJCTP8DrJYLC+baGAAnI R9hErBagNN+4PmDhWUR0FzbHn6OJ X-Google-Smtp-Source: APiQypKHcvlDzeTNur9kbSikslo4ws1TCk5fQO2/BVD1qhbSukOZA5gc1zRDLo9lBxoH0Jz6LoROTA== X-Received: by 2002:a50:8ad2:: with SMTP id k18mr20756791edk.354.1586192398207; Mon, 06 Apr 2020 09:59:58 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id i19sm3020089ejh.58.2020.04.06.09.59.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 09:59:57 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:42 +0000 Subject: [PATCH v4 02/15] bloom.c: add the murmur3 hash implementation MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh In preparation for computing changed paths Bloom filters, implement the Murmur3 hash algorithm as described in [1]. It hashes the given data using the given seed and produces a uniformly distributed hash value. [1] https://en.wikipedia.org/wiki/MurmurHash#Algorithm Helped-by: Derrick Stolee Helped-by: Szeder Gábor Reviewed-by: Jakub Narębski Signed-off-by: Garima Singh --- Makefile | 2 ++ bloom.c | 73 +++++++++++++++++++++++++++++++++++++++++++ bloom.h | 13 ++++++++ t/helper/test-bloom.c | 13 ++++++++ t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + t/t0095-bloom.sh | 30 ++++++++++++++++++ 7 files changed, 133 insertions(+) create mode 100644 bloom.c create mode 100644 bloom.h create mode 100644 t/helper/test-bloom.c create mode 100755 t/t0095-bloom.sh diff --git a/Makefile b/Makefile index ef1ff2228f0..491f75e68c5 100644 --- a/Makefile +++ b/Makefile @@ -695,6 +695,7 @@ X = PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS)) TEST_BUILTINS_OBJS += test-advise.o +TEST_BUILTINS_OBJS += test-bloom.o TEST_BUILTINS_OBJS += test-chmtime.o TEST_BUILTINS_OBJS += test-config.o TEST_BUILTINS_OBJS += test-ctype.o @@ -840,6 +841,7 @@ LIB_OBJS += base85.o LIB_OBJS += bisect.o LIB_OBJS += blame.o LIB_OBJS += blob.o +LIB_OBJS += bloom.o LIB_OBJS += branch.o LIB_OBJS += bulk-checkin.o LIB_OBJS += bundle.o diff --git a/bloom.c b/bloom.c new file mode 100644 index 00000000000..40e87632aeb --- /dev/null +++ b/bloom.c @@ -0,0 +1,73 @@ +#include "git-compat-util.h" +#include "bloom.h" + +static uint32_t rotate_left(uint32_t value, int32_t count) +{ + uint32_t mask = 8 * sizeof(uint32_t) - 1; + count &= mask; + return ((value << count) | (value >> ((-count) & mask))); +} + +/* + * Calculate the murmur3 32-bit hash value for the given data + * using the given seed. + * Produces a uniformly distributed hash value. + * Not considered to be cryptographically secure. + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm + */ +uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) +{ + const uint32_t c1 = 0xcc9e2d51; + const uint32_t c2 = 0x1b873593; + const uint32_t r1 = 15; + const uint32_t r2 = 13; + const uint32_t m = 5; + const uint32_t n = 0xe6546b64; + int i; + uint32_t k1 = 0; + const char *tail; + + int len4 = len / sizeof(uint32_t); + + uint32_t k; + for (i = 0; i < len4; i++) { + uint32_t byte1 = (uint32_t)data[4*i]; + uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8; + uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16; + uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24; + k = byte1 | byte2 | byte3 | byte4; + k *= c1; + k = rotate_left(k, r1); + k *= c2; + + seed ^= k; + seed = rotate_left(seed, r2) * m + n; + } + + tail = (data + len4 * sizeof(uint32_t)); + + switch (len & (sizeof(uint32_t) - 1)) { + case 3: + k1 ^= ((uint32_t)tail[2]) << 16; + /*-fallthrough*/ + case 2: + k1 ^= ((uint32_t)tail[1]) << 8; + /*-fallthrough*/ + case 1: + k1 ^= ((uint32_t)tail[0]) << 0; + k1 *= c1; + k1 = rotate_left(k1, r1); + k1 *= c2; + seed ^= k1; + break; + } + + seed ^= (uint32_t)len; + seed ^= (seed >> 16); + seed *= 0x85ebca6b; + seed ^= (seed >> 13); + seed *= 0xc2b2ae35; + seed ^= (seed >> 16); + + return seed; +} \ No newline at end of file diff --git a/bloom.h b/bloom.h new file mode 100644 index 00000000000..d0fcc5f0aa6 --- /dev/null +++ b/bloom.h @@ -0,0 +1,13 @@ +#ifndef BLOOM_H +#define BLOOM_H + +/* + * Calculate the murmur3 32-bit hash value for the given data + * using the given seed. + * Produces a uniformly distributed hash value. + * Not considered to be cryptographically secure. + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm + */ +uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len); + +#endif \ No newline at end of file diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c new file mode 100644 index 00000000000..60ee2043689 --- /dev/null +++ b/t/helper/test-bloom.c @@ -0,0 +1,13 @@ +#include "git-compat-util.h" +#include "bloom.h" +#include "test-tool.h" + +int cmd__bloom(int argc, const char **argv) +{ + if (!strcmp(argv[1], "get_murmur3")) { + uint32_t hashed = murmur3_seeded(0, argv[2], strlen(argv[2])); + printf("Murmur3 Hash with seed=0:0x%08x\n", hashed); + } + + return 0; +} \ No newline at end of file diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c index 31eedcd241f..6e26bd65c97 100644 --- a/t/helper/test-tool.c +++ b/t/helper/test-tool.c @@ -15,6 +15,7 @@ struct test_cmd { static struct test_cmd cmds[] = { { "advise", cmd__advise_if_enabled }, + { "bloom", cmd__bloom }, { "chmtime", cmd__chmtime }, { "config", cmd__config }, { "ctype", cmd__ctype }, diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index 4eb5e6609e1..dceeef1d5c2 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -5,6 +5,7 @@ #include "git-compat-util.h" int cmd__advise_if_enabled(int argc, const char **argv); +int cmd__bloom(int argc, const char **argv); int cmd__chmtime(int argc, const char **argv); int cmd__config(int argc, const char **argv); int cmd__ctype(int argc, const char **argv); diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh new file mode 100755 index 00000000000..2dad8c4a94e --- /dev/null +++ b/t/t0095-bloom.sh @@ -0,0 +1,30 @@ +#!/bin/sh + +test_description='Testing the various Bloom filter computations in bloom.c' +. ./test-lib.sh + +test_expect_success 'compute unseeded murmur3 hash for empty string' ' + cat >expect <<-\EOF && + Murmur3 Hash with seed=0:0x00000000 + EOF + test-tool bloom get_murmur3 "" >actual && + test_cmp expect actual +' + +test_expect_success 'compute unseeded murmur3 hash for test string 1' ' + cat >expect <<-\EOF && + Murmur3 Hash with seed=0:0x627b0c2c + EOF + test-tool bloom get_murmur3 "Hello world!" >actual && + test_cmp expect actual +' + +test_expect_success 'compute unseeded murmur3 hash for test string 2' ' + cat >expect <<-\EOF && + Murmur3 Hash with seed=0:0x2e4ff723 + EOF + test-tool bloom get_murmur3 "The quick brown fox jumps over the lazy dog" >actual && + test_cmp expect actual +' + +test_done \ No newline at end of file From patchwork Mon Apr 6 16:59:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475903 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A5D5918C6 for ; Mon, 6 Apr 2020 17:00:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7A024233EB for ; Mon, 6 Apr 2020 17:00:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VxL02eib" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729734AbgDFRAD (ORCPT ); Mon, 6 Apr 2020 13:00:03 -0400 Received: from mail-ed1-f68.google.com ([209.85.208.68]:36417 "EHLO mail-ed1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729703AbgDFRAC (ORCPT ); Mon, 6 Apr 2020 13:00:02 -0400 Received: by mail-ed1-f68.google.com with SMTP id i7so396314edq.3 for ; Mon, 06 Apr 2020 10:00:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=vU8KsqMF4jrOIHiDLFv6tiZ3pZ9gGp87uv6xrWTW7Dg=; b=VxL02eib0URGzGFpJ1jE2iBie8I1QG21FScSWPbTo8WgDT9WcTBlc+fQJlyQ79+0zf ZfcNtql+nWyaBDl1E9muPcoG/Lgu+8KJ7HVp3mA33ZmoWhiilrpOG6GZcN8agW7r590J KAQujuTXOGKQufk4FcB41xzE7Y81meV4FcuEfZQgL7Gvf93DOCVhccUybwpfIswFjVQC gSvINCqCt51ODfuvP2mGLGcPxZ6kwSxdAEYtowjw1AaiUN8D25qYctPnRAOrfrmVtqHn epMHFOJM7c7FpSswqKl/xNAXSytsnBMuo9NRXSRTXRpS8ok1Rh7U3ioKVrupC28v9Ze6 hlPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=vU8KsqMF4jrOIHiDLFv6tiZ3pZ9gGp87uv6xrWTW7Dg=; b=eUEoOCQu03qhYy3b3OMeCAwVC0ab0MKQIahtW5Gvkb685lKTbZ13s3CoJRnXGWRNva 7hz2LdVqacrABUtcSMmDZDps27CRZ8HLoOsi8nlDlgIvTgb8oxZUI0lNmchlo0ln7hJA OGPAMxEHBY7fr5xQmQQgSByb1esT7wZUKpkFMGPS35QmQDGWSjSEsVhVzWr4Yagk05Qy BaYbtq/D4Kqh10100mM7NlFtgTAebpAkG7ZXxm4ekrVJWBnUjy2wL8cKySSR4FWZ0qfj 5TCoaagzVe6VaZmC/AraKAMrUK1eiUQ1UBFWTgO0Qrs0FPv17Y7LD4ZWPuLG8N0RIU5C QKzQ== X-Gm-Message-State: AGi0PuY1KCOJLVuzfJHs5e6zZ8O5KDfWkudIBwCAWN5qJ4HGzjCqLAGV /0f2RwPI2SWshNmvxpEBfBC+jvWf X-Google-Smtp-Source: APiQypLR8MAERiceQ6iq+kxgofEinZrMd9kr/ZPoqsdrkjcyGyDkMdBb2F0lU21FjZPFtApzBkx93Q== X-Received: by 2002:a17:906:e25a:: with SMTP id gq26mr438073ejb.352.1586192398979; Mon, 06 Apr 2020 09:59:58 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id v12sm2511072edw.51.2020.04.06.09.59.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 09:59:58 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:43 +0000 Subject: [PATCH v4 03/15] bloom.c: introduce core Bloom filter constructs MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh Introduce the constructs for Bloom filters, Bloom filter keys and Bloom filter settings. For details on what Bloom filters are and how they work, refer to Dr. Derrick Stolee's blog post [1]. It provides a concise explanation of the adoption of Bloom filters as described in [2] and [3]. Implementation specifics: 1. We currently use 7 and 10 for the number of hashes and the size of each entry respectively. They served as great starting values, the mathematical details behind this choice are described in [1] and [4]. The implementation, while not completely open to it at the moment, is flexible enough to allow for tweaking these settings in the future. Note: The performance gains we have observed with these values are significant enough that we did not need to tweak these settings. The performance numbers are included in the cover letter of this series and in the commit message of the subsequent commit where we use Bloom filters to speed up `git log -- path`. 2. As described in [1] and [3], we do not need 7 independent hashing functions. We use the Murmur3 hashing scheme, seed it twice and then combine those to procure an arbitrary number of hash values. 3. The filters will be sized according to the number of changes in each commit, in multiples of 8 bit words. [1] Derrick Stolee "Supercharging the Git Commit Graph IV: Bloom Filters" https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/ [2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese "An Improved Construction for Counting Bloom Filters" http://theory.stanford.edu/~rinap/papers/esa2006b.pdf https://doi.org/10.1007/11841036_61 [3] Peter C. Dillinger and Panagiotis Manolios "Bloom Filters in Probabilistic Verification" http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf https://doi.org/10.1007/978-3-540-30494-4_26 [4] Thomas Mueller Graf, Daniel Lemire "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters" https://arxiv.org/abs/1912.08258 Helped-by: Derrick Stolee Reviewed-by: Jakub Narębski Signed-off-by: Garima Singh --- bloom.c | 38 +++++++++++++++++++++++++- bloom.h | 63 +++++++++++++++++++++++++++++++++++++++++++ t/helper/test-bloom.c | 48 +++++++++++++++++++++++++++++++++ t/t0095-bloom.sh | 40 +++++++++++++++++++++++++++ 4 files changed, 188 insertions(+), 1 deletion(-) diff --git a/bloom.c b/bloom.c index 40e87632aeb..888b67f1ea6 100644 --- a/bloom.c +++ b/bloom.c @@ -8,6 +8,11 @@ static uint32_t rotate_left(uint32_t value, int32_t count) return ((value << count) | (value >> ((-count) & mask))); } +static inline unsigned char get_bitmask(uint32_t pos) +{ + return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1)); +} + /* * Calculate the murmur3 32-bit hash value for the given data * using the given seed. @@ -70,4 +75,35 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) seed ^= (seed >> 16); return seed; -} \ No newline at end of file +} + +void fill_bloom_key(const char *data, + size_t len, + struct bloom_key *key, + const struct bloom_filter_settings *settings) +{ + int i; + const uint32_t seed0 = 0x293ae76f; + const uint32_t seed1 = 0x7e646e2c; + const uint32_t hash0 = murmur3_seeded(seed0, data, len); + const uint32_t hash1 = murmur3_seeded(seed1, data, len); + + key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t)); + for (i = 0; i < settings->num_hashes; i++) + key->hashes[i] = hash0 + i * hash1; +} + +void add_key_to_filter(const struct bloom_key *key, + struct bloom_filter *filter, + const struct bloom_filter_settings *settings) +{ + int i; + uint64_t mod = filter->len * BITS_PER_WORD; + + for (i = 0; i < settings->num_hashes; i++) { + uint64_t hash_mod = key->hashes[i] % mod; + uint64_t block_pos = hash_mod / BITS_PER_WORD; + + filter->data[block_pos] |= get_bitmask(hash_mod); + } +} diff --git a/bloom.h b/bloom.h index d0fcc5f0aa6..b9ce422ca2d 100644 --- a/bloom.h +++ b/bloom.h @@ -1,6 +1,60 @@ #ifndef BLOOM_H #define BLOOM_H +struct bloom_filter_settings { + /* + * The version of the hashing technique being used. + * We currently only support version = 1 which is + * the seeded murmur3 hashing technique implemented + * in bloom.c. + */ + uint32_t hash_version; + + /* + * The number of times a path is hashed, i.e. the + * number of bit positions tht cumulatively + * determine whether a path is present in the + * Bloom filter. + */ + uint32_t num_hashes; + + /* + * The minimum number of bits per entry in the Bloom + * filter. If the filter contains 'n' entries, then + * filter size is the minimum number of 8-bit words + * that contain n*b bits. + */ + uint32_t bits_per_entry; +}; + +#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 } +#define BITS_PER_WORD 8 + +/* + * A bloom_filter struct represents a data segment to + * use when testing hash values. The 'len' member + * dictates how many entries are stored in + * 'data'. + */ +struct bloom_filter { + unsigned char *data; + size_t len; +}; + +/* + * A bloom_key represents the k hash values for a + * given string. These can be precomputed and + * stored in a bloom_key for re-use when testing + * against a bloom_filter. The number of hashes is + * given by the Bloom filter settings and is the same + * for all Bloom filters and keys interacting with + * the loaded version of the commit graph file and + * the Bloom data chunks. + */ +struct bloom_key { + uint32_t *hashes; +}; + /* * Calculate the murmur3 32-bit hash value for the given data * using the given seed. @@ -10,4 +64,13 @@ */ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len); +void fill_bloom_key(const char *data, + size_t len, + struct bloom_key *key, + const struct bloom_filter_settings *settings); + +void add_key_to_filter(const struct bloom_key *key, + struct bloom_filter *filter, + const struct bloom_filter_settings *settings); + #endif \ No newline at end of file diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c index 60ee2043689..20460cde775 100644 --- a/t/helper/test-bloom.c +++ b/t/helper/test-bloom.c @@ -2,6 +2,36 @@ #include "bloom.h" #include "test-tool.h" +struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS; + +static void add_string_to_filter(const char *data, struct bloom_filter *filter) { + struct bloom_key key; + int i; + + fill_bloom_key(data, strlen(data), &key, &settings); + printf("Hashes:"); + for (i = 0; i < settings.num_hashes; i++){ + printf("0x%08x|", key.hashes[i]); + } + printf("\n"); + add_key_to_filter(&key, filter, &settings); +} + +static void print_bloom_filter(struct bloom_filter *filter) { + int i; + + if (!filter) { + printf("No filter.\n"); + return; + } + printf("Filter_Length:%d\n", (int)filter->len); + printf("Filter_Data:"); + for (i = 0; i < filter->len; i++){ + printf("%02x|", filter->data[i]); + } + printf("\n"); +} + int cmd__bloom(int argc, const char **argv) { if (!strcmp(argv[1], "get_murmur3")) { @@ -9,5 +39,23 @@ int cmd__bloom(int argc, const char **argv) printf("Murmur3 Hash with seed=0:0x%08x\n", hashed); } + if (!strcmp(argv[1], "generate_filter")) { + struct bloom_filter filter; + int i = 2; + filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD; + filter.data = xcalloc(filter.len, sizeof(unsigned char)); + + if (!argv[2]){ + die("at least one input string expected"); + } + + while (argv[i]) { + add_string_to_filter(argv[i], &filter); + i++; + } + + print_bloom_filter(&filter); + } + return 0; } \ No newline at end of file diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh index 2dad8c4a94e..36a086c7c60 100755 --- a/t/t0095-bloom.sh +++ b/t/t0095-bloom.sh @@ -27,4 +27,44 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' ' test_cmp expect actual ' +test_expect_success 'compute bloom key for empty string' ' + cat >expect <<-\EOF && + Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004| + Filter_Length:2 + Filter_Data:11|11| + EOF + test-tool bloom generate_filter "" >actual && + test_cmp expect actual +' + +test_expect_success 'compute bloom key for whitespace' ' + cat >expect <<-\EOF && + Hashes:0xf178874c|0x5f3d6eb6|0xcd025620|0x3ac73d8a|0xa88c24f4|0x16510c5e|0x8415f3c8| + Filter_Length:2 + Filter_Data:51|55| + EOF + test-tool bloom generate_filter " " >actual && + test_cmp expect actual +' + +test_expect_success 'compute bloom key for test string 1' ' + cat >expect <<-\EOF && + Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d| + Filter_Length:2 + Filter_Data:92|6c| + EOF + test-tool bloom generate_filter "Hello world!" >actual && + test_cmp expect actual +' + +test_expect_success 'compute bloom key for test string 2' ' + cat >expect <<-\EOF && + Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585| + Filter_Length:2 + Filter_Data:a5|4a| + EOF + test-tool bloom generate_filter "file.txt" >actual && + test_cmp expect actual +' + test_done \ No newline at end of file From patchwork Mon Apr 6 16:59:44 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475909 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 85413912 for ; Mon, 6 Apr 2020 17:00:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 58E2E233EB for ; Mon, 6 Apr 2020 17:00:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="J3agDrgj" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729740AbgDFRAE (ORCPT ); Mon, 6 Apr 2020 13:00:04 -0400 Received: from mail-ed1-f65.google.com ([209.85.208.65]:45127 "EHLO mail-ed1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729705AbgDFRAD (ORCPT ); Mon, 6 Apr 2020 13:00:03 -0400 Received: by mail-ed1-f65.google.com with SMTP id m12so322587edl.12 for ; Mon, 06 Apr 2020 10:00:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=hB+IdW0dP0vYOihG25fJnUEDSo2H4LGMF60eszlgdzA=; b=J3agDrgjZD7tJJCBpu16uq9XM/WeE6GZtMIjxuuyeu9+KwbAN7H0eFgskFuLUzRE10 6+frQ8hsPSIn3dS2UZzxKXRPiDqKgKsOFVmOiYZeQx7dr1VtTgh+JXNrP7ANKoFjcX+b 6D3t1my1WnzoEj3MeRziGJ/u8fLQZ9ervKg5R9UWlD4ypWu+oPammIYlsmEkbiAZXDzA J4UOeRxyY4UfiYLNCxx3phAJcImn6m29P/cf5IjfgL3cQBdRA+S96uCZ5XycZ7RbRiw1 XU+5PPTdAAnVcIqpnOt7/FWFkH+XxWvWO+WOS6xxQKrZoncQFNnAfiRHb9asZLaGN4SD oouQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=hB+IdW0dP0vYOihG25fJnUEDSo2H4LGMF60eszlgdzA=; b=qGstLe+j2i5eyh+aRoYL/aLiis1A/d5+Od5ezt+qT6zvGZwogHZJ5RvaMU80UIepU7 Y0DFshHq8/gEf+niwM6i9bDFr7ABrK5LzaeU89dnyH5Ad+srLdMLfnLGvawpim6yccsq tRsQjPJ1NNXkWrQ0nNHfr09O01x4hyVpk8ufbRQuB8WGlHCpiWxr2q0zQrBOqW3jVjWh GqQIG675KXRYM3NkgDsFs8o00xWofverYsg+seg5QyTs/k/YzkPJPYJCtxT+8CrIepwf 0wpBYwABxSwKw+554gMkkfNJnLOw3+HNvE0Hh3TnXf30c4C0cOlcQim6ly0Tkkuriumy 3J5A== X-Gm-Message-State: AGi0PuaNq6aBAyStKV+vleCQKxeLpB7unV0gNZk3fXzU3f4sA2NdScm+ TCbVkUrABaCBZ7vFJo+/dHkceT9M X-Google-Smtp-Source: APiQypLbUhiHTLWYfYeTxVhvqEIaNGoEpkJVRI+Q5cDD7DlwvxsMAtmTSuMNuQmqJqCF6Uax4ropTA== X-Received: by 2002:a17:906:d1cf:: with SMTP id bs15mr453829ejb.280.1586192399715; Mon, 06 Apr 2020 09:59:59 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id dj1sm2544700edb.70.2020.04.06.09.59.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 09:59:59 -0700 (PDT) Message-Id: <8304c2975207ee847c6709abd71efee918fc4142.1586192395.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:44 +0000 Subject: [PATCH v4 04/15] bloom.c: core Bloom filter implementation for changed paths. MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh Add the core implementation for computing Bloom filters for the paths changed between a commit and it's first parent. We fill the Bloom filters as (const char *data, int len) pairs as `struct bloom_filters" within a commit slab. Filters for commits with no changes and more than 512 changes, is represented with a filter of length zero. There is no gain in distinguishing between a computed filter of length zero for a commit with no changes, and an uncomputed filter for new commits or for commits with more than 512 changes. The effect on `git log -- path` is the same in both cases. We will fall back to the normal diffing algorithm when we can't benefit from the existence of Bloom filters. Helped-by: Jeff King Helped-by: Derrick Stolee Reviewed-by: Jakub Narębski Signed-off-by: Garima Singh --- bloom.c | 97 +++++++++++++++++++++++++++++++++++++++++++ bloom.h | 8 ++++ t/helper/test-bloom.c | 20 +++++++++ t/t0095-bloom.sh | 47 +++++++++++++++++++++ 4 files changed, 172 insertions(+) diff --git a/bloom.c b/bloom.c index 888b67f1ea6..881a9841ede 100644 --- a/bloom.c +++ b/bloom.c @@ -1,5 +1,18 @@ #include "git-compat-util.h" #include "bloom.h" +#include "diff.h" +#include "diffcore.h" +#include "revision.h" +#include "hashmap.h" + +define_commit_slab(bloom_filter_slab, struct bloom_filter); + +struct bloom_filter_slab bloom_filters; + +struct pathmap_hash_entry { + struct hashmap_entry entry; + const char path[FLEX_ARRAY]; +}; static uint32_t rotate_left(uint32_t value, int32_t count) { @@ -107,3 +120,87 @@ void add_key_to_filter(const struct bloom_key *key, filter->data[block_pos] |= get_bitmask(hash_mod); } } + +void init_bloom_filters(void) +{ + init_bloom_filter_slab(&bloom_filters); +} + +struct bloom_filter *get_bloom_filter(struct repository *r, + struct commit *c) +{ + struct bloom_filter *filter; + struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS; + int i; + struct diff_options diffopt; + + if (bloom_filters.slab_size == 0) + return NULL; + + filter = bloom_filter_slab_at(&bloom_filters, c); + + repo_diff_setup(r, &diffopt); + diffopt.flags.recursive = 1; + diff_setup_done(&diffopt); + + if (c->parents) + diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt); + else + diff_tree_oid(NULL, &c->object.oid, "", &diffopt); + diffcore_std(&diffopt); + + if (diff_queued_diff.nr <= 512) { + struct hashmap pathmap; + struct pathmap_hash_entry *e; + struct hashmap_iter iter; + hashmap_init(&pathmap, NULL, NULL, 0); + + for (i = 0; i < diff_queued_diff.nr; i++) { + const char *path = diff_queued_diff.queue[i]->two->path; + + /* + * Add each leading directory of the changed file, i.e. for + * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so + * the Bloom filter could be used to speed up commands like + * 'git log dir/subdir', too. + * + * Note that directories are added without the trailing '/'. + */ + do { + char *last_slash = strrchr(path, '/'); + + FLEX_ALLOC_STR(e, path, path); + hashmap_entry_init(&e->entry, strhash(path)); + hashmap_add(&pathmap, &e->entry); + + if (!last_slash) + last_slash = (char*)path; + *last_slash = '\0'; + + } while (*path); + + diff_free_filepair(diff_queued_diff.queue[i]); + } + + filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD; + filter->data = xcalloc(filter->len, sizeof(unsigned char)); + + hashmap_for_each_entry(&pathmap, &iter, e, entry) { + struct bloom_key key; + fill_bloom_key(e->path, strlen(e->path), &key, &settings); + add_key_to_filter(&key, filter, &settings); + } + + hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry); + } else { + for (i = 0; i < diff_queued_diff.nr; i++) + diff_free_filepair(diff_queued_diff.queue[i]); + filter->data = NULL; + filter->len = 0; + } + + free(diff_queued_diff.queue); + DIFF_QUEUE_CLEAR(&diff_queued_diff); + + return filter; +} diff --git a/bloom.h b/bloom.h index b9ce422ca2d..85ab8e9423d 100644 --- a/bloom.h +++ b/bloom.h @@ -1,6 +1,9 @@ #ifndef BLOOM_H #define BLOOM_H +struct commit; +struct repository; + struct bloom_filter_settings { /* * The version of the hashing technique being used. @@ -73,4 +76,9 @@ void add_key_to_filter(const struct bloom_key *key, struct bloom_filter *filter, const struct bloom_filter_settings *settings); +void init_bloom_filters(void); + +struct bloom_filter *get_bloom_filter(struct repository *r, + struct commit *c); + #endif \ No newline at end of file diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c index 20460cde775..f18d1b722e1 100644 --- a/t/helper/test-bloom.c +++ b/t/helper/test-bloom.c @@ -1,6 +1,7 @@ #include "git-compat-util.h" #include "bloom.h" #include "test-tool.h" +#include "commit.h" struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS; @@ -32,6 +33,16 @@ static void print_bloom_filter(struct bloom_filter *filter) { printf("\n"); } +static void get_bloom_filter_for_commit(const struct object_id *commit_oid) +{ + struct commit *c; + struct bloom_filter *filter; + setup_git_directory(); + c = lookup_commit(the_repository, commit_oid); + filter = get_bloom_filter(the_repository, c); + print_bloom_filter(filter); +} + int cmd__bloom(int argc, const char **argv) { if (!strcmp(argv[1], "get_murmur3")) { @@ -57,5 +68,14 @@ int cmd__bloom(int argc, const char **argv) print_bloom_filter(&filter); } + if (!strcmp(argv[1], "get_filter_for_commit")) { + struct object_id oid; + const char *end; + if (parse_oid_hex(argv[2], &oid, &end)) + die("cannot parse oid '%s'", argv[2]); + init_bloom_filters(); + get_bloom_filter_for_commit(&oid); + } + return 0; } \ No newline at end of file diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh index 36a086c7c60..8f9eef116dc 100755 --- a/t/t0095-bloom.sh +++ b/t/t0095-bloom.sh @@ -67,4 +67,51 @@ test_expect_success 'compute bloom key for test string 2' ' test_cmp expect actual ' +test_expect_success 'get bloom filters for commit with no changes' ' + git init && + git commit --allow-empty -m "c0" && + cat >expect <<-\EOF && + Filter_Length:0 + Filter_Data: + EOF + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && + test_cmp expect actual +' + +test_expect_success 'get bloom filter for commit with 10 changes' ' + rm actual && + rm expect && + mkdir smallDir && + for i in $(test_seq 0 9) + do + echo $i >smallDir/$i + done && + git add smallDir && + git commit -m "commit with 10 changes" && + cat >expect <<-\EOF && + Filter_Length:25 + Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23| + EOF + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && + test_cmp expect actual +' + +test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' ' + rm actual && + rm expect && + mkdir bigDir && + for i in $(test_seq 0 512) + do + echo $i >bigDir/$i + done && + git add bigDir && + git commit -m "commit with 513 changes" && + cat >expect <<-\EOF && + Filter_Length:0 + Filter_Data: + EOF + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && + test_cmp expect actual +' + test_done \ No newline at end of file From patchwork Mon Apr 6 16:59:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475907 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 141A81892 for ; Mon, 6 Apr 2020 17:00:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D9FE521841 for ; Mon, 6 Apr 2020 17:00:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VjK1yEAv" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729730AbgDFRAC (ORCPT ); Mon, 6 Apr 2020 13:00:02 -0400 Received: from mail-ed1-f67.google.com ([209.85.208.67]:38507 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729574AbgDFRAC (ORCPT ); Mon, 6 Apr 2020 13:00:02 -0400 Received: by mail-ed1-f67.google.com with SMTP id e5so378889edq.5 for ; Mon, 06 Apr 2020 10:00:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=b09N7vSxma0z7ZkQvyBSUo7VU222X1/gp1zj6yECZGU=; b=VjK1yEAvX+oClxI3FeCtGk6J8/gC1yLS88hqGMnUi+bWCn6p1hmbXfUwi4cqveTLz8 kKJGQ2d6dmcr4WHIyIGs/cAE8BGCqI3O20x7gpIpJr8sxjhSWKJ2wPBnfDbQx2BRrJQk V7a+cF9sHJ8zsQQcEgLRjhXUBi4yE+nuTl7GanMERa+Je0seu4AaQ81M05eFNzdMXxG2 rhh9UV3hRlWPB9BHg8Xpiupny0yN0vqRIygDNuPRZvjzmdCENgFkbNzW72SM3Osrqfen fpd/F8HZ4KrHQwiYLsC7I3hMAkoI3p+Lwy4MuAF3uscwBM5QV8tSBCFXaxgGhlzX0lnD sFqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=b09N7vSxma0z7ZkQvyBSUo7VU222X1/gp1zj6yECZGU=; b=ZxJj1xV47Pq1v2YWM/3w7nAjnxKSajfTg5nUKPEQIZQO2BaALzWmtDt57Gz/FfWvjn OMXXQng2y1XMMpo7zttNdmuG7XgKXgvG5cez0FMcuMtFxaMH5FrMVE63Y6CazsbELhhY eRdxB6xvK8Z6B2CjykTCEFC2g0r/zMgs3IySU03td+2HcwYNTnv6a+TjLFO7Lawx9uXM EPOexgb1iqNXRg/AjBbb0e7IredlBcFPgrM4EDVvwDVlI9BAWUuGbxPjAvr7VNATFhrJ AGBkqWLPtDDZwqUTlgpxqPv9MdfS3FEWYFS9i+wwBpy1eRbskrdXiP/4DS632e4uIbrr tKpg== X-Gm-Message-State: AGi0PubmEdkgiyiEtSy2DiuQK/phd8xe7i1+b3gMbeCz+jTOoBpiaaVO Gc523GbKvLEtGkP8L3sy+DCI0iai X-Google-Smtp-Source: APiQypIgNP8ARY7uKc8gs+zM9v9PNBUrckFKvSpm4UjFsyoesEflBloZNANxY0x5P3fBJp4XM8u5ug== X-Received: by 2002:a50:ab1e:: with SMTP id s30mr20816734edc.336.1586192400378; Mon, 06 Apr 2020 10:00:00 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id p20sm2602862edw.31.2020.04.06.09.59.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 09:59:59 -0700 (PDT) Message-Id: <2d4c0b2da38632424c8bd31ccb2037e0676c3c74.1586192395.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Derrick Stolee via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:45 +0000 Subject: [PATCH v4 05/15] diff: halt tree-diff early after max_changes Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Derrick Stolee Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Derrick Stolee When computing the changed-paths bloom filters for the commit-graph, we limit the size of the filter by restricting the number of paths in the diff. Instead of computing a large diff and then ignoring the result, it is better to halt the diff computation early. Create a new "max_changes" option in struct diff_options. If non-zero, then halt the diff computation after discovering strictly more changed paths. This includes paths corresponding to trees that change. Use this max_changes option in the bloom filter calculations. This reduces the time taken to compute the filters for the Linux kernel repo from 2m50s to 2m35s. On a large internal repository with ~500 commits that perform tree-wide changes, the time reduced from 6m15s to 3m48s. Signed-off-by: Derrick Stolee Signed-off-by: Garima Singh --- bloom.c | 4 +++- diff.h | 5 +++++ tree-diff.c | 6 ++++++ 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/bloom.c b/bloom.c index 881a9841ede..a16eee92331 100644 --- a/bloom.c +++ b/bloom.c @@ -133,6 +133,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS; int i; struct diff_options diffopt; + int max_changes = 512; if (bloom_filters.slab_size == 0) return NULL; @@ -141,6 +142,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, repo_diff_setup(r, &diffopt); diffopt.flags.recursive = 1; + diffopt.max_changes = max_changes; diff_setup_done(&diffopt); if (c->parents) @@ -149,7 +151,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, diff_tree_oid(NULL, &c->object.oid, "", &diffopt); diffcore_std(&diffopt); - if (diff_queued_diff.nr <= 512) { + if (diff_queued_diff.nr <= max_changes) { struct hashmap pathmap; struct pathmap_hash_entry *e; struct hashmap_iter iter; diff --git a/diff.h b/diff.h index 6febe7e3656..9443dc1b003 100644 --- a/diff.h +++ b/diff.h @@ -285,6 +285,11 @@ struct diff_options { /* Number of hexdigits to abbreviate raw format output to. */ int abbrev; + /* If non-zero, then stop computing after this many changes. */ + int max_changes; + /* For internal use only. */ + int num_changes; + int ita_invisible_in_index; /* white-space error highlighting */ #define WSEH_NEW (1<<12) diff --git a/tree-diff.c b/tree-diff.c index 33ded7f8b3e..f3d303c6e54 100644 --- a/tree-diff.c +++ b/tree-diff.c @@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths( if (diff_can_quit_early(opt)) break; + if (opt->max_changes && opt->num_changes > opt->max_changes) + break; + if (opt->pathspec.nr) { skip_uninteresting(&t, base, opt); for (i = 0; i < nparent; i++) @@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths( /* t↓ */ update_tree_entry(&t); + opt->num_changes++; } /* t > p[imin] */ @@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths( skip_emit_tp: /* ∀ pi=p[imin] pi↓ */ update_tp_entries(tp, nparent); + opt->num_changes++; } } @@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths( const struct object_id **parents_oid, int nparent, struct strbuf *base, struct diff_options *opt) { + opt->num_changes = 0; p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt); /* From patchwork Mon Apr 6 16:59:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475911 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A8248912 for ; Mon, 6 Apr 2020 17:00:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 851D620780 for ; Mon, 6 Apr 2020 17:00:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gOmHIflD" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729745AbgDFRAF (ORCPT ); Mon, 6 Apr 2020 13:00:05 -0400 Received: from mail-ed1-f68.google.com ([209.85.208.68]:40504 "EHLO mail-ed1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729731AbgDFRAE (ORCPT ); Mon, 6 Apr 2020 13:00:04 -0400 Received: by mail-ed1-f68.google.com with SMTP id w26so360261edu.7 for ; Mon, 06 Apr 2020 10:00:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=KAmO1RySrgApRaoIdXSMKQzey0ygDFzHa23GpHnOhvk=; b=gOmHIflDLr3c58VBq0RTQet6FCkRX1aO6cf76WJkf/jF4nrPQ5X+EoYAwmMCfoXRSX oFxt1ZTtwm731hjU8rQio5vkjj3t6Kl53ZjWO6u+AiteZOFxuMsCjCrViBPsjthDHUKr LVKEBRZiEzXRNnxtIc6ZOue8lpfB/Z+3I3A5/wizL40Scrh0ywA79DK2+th37dB/BicO 5jPkbDEn8a0GRG/NGirfdWRECs6cV508Ng2au3l1RdGcZQgqpVcGfvLQAbiYcgIxWimB /kfSJwS3NazMrAjlwK6fftlzcSDqRF7pAaGSsZ+zRmfg4koWXQRI076LAChaJZz7uuzf e6Yw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=KAmO1RySrgApRaoIdXSMKQzey0ygDFzHa23GpHnOhvk=; b=Q5AdVsOWUxtmHeeRCp9ooab8avGn9BLiMDFBYY4F/niILacTermLjDECXb4tSlIG5x 6mEDqA1hwnC4SwBE8+DhJM+PBno/vbszs/O4g162Yx62sj80ivneR6jJIn+ezpN9H5dY Szub181WYeB+P+voTGGPR879H0Xg1jqAv0i74TqVApFRZ8Ohe+nCdX+jw7kRStrh6/U5 5159vML1/VcMhCLNaDP3/fDNATNZlMdozQC5cQKsAG6fJSq6e2L7APcCHku3qXdTGXPY q4/cASyl5bsx8BxWAh/rrEkvVrGvnnlTSNzsmCkXgWmLtcLVvwd92rk8k4ITa5N3WTWF YkJA== X-Gm-Message-State: AGi0PuZufd2/KbJYQFUPijpBBapEDpqp+n/vvCeXL3t9Qaw9UZXlauod tTocW07UJQWdiqOOvJioWv9xuvJW X-Google-Smtp-Source: APiQypLzJwpNktdudPIunErpwM+DLDcY3IVRxQVkj5D/3F4F7XBzeL9GdXOcvA6YY11vLWf+zXPgPQ== X-Received: by 2002:a17:906:3502:: with SMTP id r2mr416807eja.67.1586192401265; Mon, 06 Apr 2020 10:00:01 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id s26sm1283701edx.9.2020.04.06.10.00.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:00 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:46 +0000 Subject: [PATCH v4 06/15] commit-graph: compute Bloom filters for changed paths Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute Bloom filters for the paths that changed between a commit and it's first parent, for each commit in the commit-graph. This computation is done on a commit-by-commit basis. We will write these Bloom filters to the commit-graph file, to store this data on disk, in the next change in this series. Helped-by: Derrick Stolee Signed-off-by: Garima Singh --- commit-graph.c | 32 +++++++++++++++++++++++++++++++- commit-graph.h | 3 ++- 2 files changed, 33 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index e4f1a5b2f1a..862a00d67ed 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -16,6 +16,7 @@ #include "hashmap.h" #include "replace-object.h" #include "progress.h" +#include "bloom.h" #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ @@ -789,9 +790,11 @@ struct write_commit_graph_context { unsigned append:1, report_progress:1, split:1, - check_oids:1; + check_oids:1, + changed_paths:1; const struct split_commit_graph_opts *split_opts; + size_t total_bloom_filter_data_size; }; static void write_graph_chunk_fanout(struct hashfile *f, @@ -1134,6 +1137,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) stop_progress(&ctx->progress); } +static void compute_bloom_filters(struct write_commit_graph_context *ctx) +{ + int i; + struct progress *progress = NULL; + + init_bloom_filters(); + + if (ctx->report_progress) + progress = start_delayed_progress( + _("Computing commit changed paths Bloom filters"), + ctx->commits.nr); + + for (i = 0; i < ctx->commits.nr; i++) { + struct commit *c = ctx->commits.list[i]; + struct bloom_filter *filter = get_bloom_filter(ctx->r, c); + ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len; + display_progress(progress, i + 1); + } + + stop_progress(&progress); +} + static int add_ref_to_list(const char *refname, const struct object_id *oid, int flags, void *cb_data) @@ -1776,6 +1801,8 @@ int write_commit_graph(struct object_directory *odb, ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0; ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0; ctx->split_opts = split_opts; + ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0; + ctx->total_bloom_filter_data_size = 0; if (ctx->split) { struct commit_graph *g; @@ -1870,6 +1897,9 @@ int write_commit_graph(struct object_directory *odb, compute_generation_numbers(ctx); + if (ctx->changed_paths) + compute_bloom_filters(ctx); + res = write_commit_graph_file(ctx); if (ctx->split) diff --git a/commit-graph.h b/commit-graph.h index e87a6f63600..86be81219da 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -79,7 +79,8 @@ enum commit_graph_write_flags { COMMIT_GRAPH_WRITE_PROGRESS = (1 << 1), COMMIT_GRAPH_WRITE_SPLIT = (1 << 2), /* Make sure that each OID in the input is a valid commit OID. */ - COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3) + COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3), + COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4), }; struct split_commit_graph_opts { From patchwork Mon Apr 6 16:59:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475915 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BF68F1392 for ; Mon, 6 Apr 2020 17:00:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 9C6DA21841 for ; Mon, 6 Apr 2020 17:00:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ZW2IVRpc" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729748AbgDFRAH (ORCPT ); Mon, 6 Apr 2020 13:00:07 -0400 Received: from mail-ed1-f67.google.com ([209.85.208.67]:34047 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729534AbgDFRAG (ORCPT ); Mon, 6 Apr 2020 13:00:06 -0400 Received: by mail-ed1-f67.google.com with SMTP id o1so412889edv.1 for ; Mon, 06 Apr 2020 10:00:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=UFasefMTFJlBZT0H7l3YT+5vQ3mTXZeAfKd7dCO5Krw=; b=ZW2IVRpcpynnmdWd9MwShV+IGVlix1i0OmwnruhqVydq4sGWjbOdU4QZFCiK46DepB sCvGTuAfSJtEd6hX9BXqN8JUde33W6XIcQtZ8feRsAjaGIPHYg3znsPUQubJn5q8cjbV GnFKTOEbKNyX8CNeQSV5M5X//+q37XEO7VCGG2/g73yK0GacUxkP2pvrnMK5q+jRfjUd Yw3KOgWrB4Pa0+qa92O9kNj4e2Odxx+pLrNZfplsuV42gHNMX4yRBqU5qFjLYIlOPX0Q dBpPz+KJe9akrpIXuWnHeQEX8bMYSbuiWTIHibJQFuYzyCdseFNKTrQjp41oHuUGRCjb /4bw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=UFasefMTFJlBZT0H7l3YT+5vQ3mTXZeAfKd7dCO5Krw=; b=iuUru8iYsuRDkdvlm+jciFzwsxKZA76kg7a4V7z53SJMUu49lFvwYFMAIpxeiNxINT VSAJK5XktvA/54doh8QyCNSjVQCNZyxdyfoy4oEgLMRAmM3wk/hBvCj4edL1MRJhKdJ9 MkKPgE2Lyl6+n24dk6bjEB5Y8kQG5KukDJGbD7fKX08XdOLm+dWnhqGSGjyKtVWsIWj5 dtyzKU71AeUPRwVBEbootAjukKhGaieCPtwZ62WoBvK7mHhnxmWdROIgkpGvUxnTXm0o aYRUSiNs3GLlexRuLt6W4rgXS7D4hn2b8fPpLUG2JwISt0xMrDN53Ib70TJpzqkdJDRm 5VeA== X-Gm-Message-State: AGi0PuYcbrw+O8V5qtiygi1ElGrvqdt/55wOlqCz/8T6ZOHAfTtiUgC3 vjTaz1joXR5QgY771NL+xwZ9D9jC X-Google-Smtp-Source: APiQypKetCJooeemvfUfapTBPls4zQR7iXPoJNSXBYeQ8kyqIDIG1JjmsakDbjkCm+/AjhrfNfpvFw== X-Received: by 2002:aa7:d88f:: with SMTP id u15mr9444826edq.87.1586192402142; Mon, 06 Apr 2020 10:00:02 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id s7sm2988773ejx.28.2020.04.06.10.00.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:01 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Jeff King via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:47 +0000 Subject: [PATCH v4 07/15] commit-graph: examine changed-path objects in pack order Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Jeff King Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Jeff King Looking at the diff of commit objects in pack order is much faster than in sha1 order, as it gives locality to the access of tree deltas (whereas sha1 order is effectively random). Unfortunately the commit-graph code sorts the commits (several times, sometimes as an oid and sometimes a pointer-to-commit), and we ultimately traverse in sha1 order. Instead, let's remember the position at which we see each commit, and traverse in that order when looking at bloom filters. This drops my time for "git commit-graph write --changed-paths" in linux.git from ~4 minutes to ~1.5 minutes. Probably the "--reachable" code path would want something similar. Or alternatively, we could use a different data structure (either a hash, or maybe even just a bit in "struct commit") to keep track of which oids we've seen, etc instead of sorting. And then we could keep the original order. Signed-off-by: Jeff King Signed-off-by: Garima Singh --- commit-graph.c | 38 +++++++++++++++++++++++++++++++++++--- 1 file changed, 35 insertions(+), 3 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 862a00d67ed..31b06f878ce 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -17,6 +17,7 @@ #include "replace-object.h" #include "progress.h" #include "bloom.h" +#include "commit-slab.h" #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ @@ -46,9 +47,32 @@ /* Remember to update object flag allocation in object.h */ #define REACHABLE (1u<<15) -char *get_commit_graph_filename(struct object_directory *odb) +/* Keep track of the order in which commits are added to our list. */ +define_commit_slab(commit_pos, int); +static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos); + +static void set_commit_pos(struct repository *r, const struct object_id *oid) +{ + static int32_t max_pos; + struct commit *commit = lookup_commit(r, oid); + + if (!commit) + return; /* should never happen, but be lenient */ + + *commit_pos_at(&commit_pos, commit) = max_pos++; +} + +static int commit_pos_cmp(const void *va, const void *vb) { - return xstrfmt("%s/info/commit-graph", odb->path); + const struct commit *a = *(const struct commit **)va; + const struct commit *b = *(const struct commit **)vb; + return commit_pos_at(&commit_pos, a) - + commit_pos_at(&commit_pos, b); +} + +char *get_commit_graph_filename(struct object_directory *obj_dir) +{ + return xstrfmt("%s/info/commit-graph", obj_dir->path); } static char *get_split_graph_filename(struct object_directory *odb, @@ -1021,6 +1045,8 @@ static int add_packed_commits(const struct object_id *oid, oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid); ctx->oids.nr++; + set_commit_pos(ctx->r, oid); + return 0; } @@ -1141,6 +1167,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx) { int i; struct progress *progress = NULL; + struct commit **sorted_commits; init_bloom_filters(); @@ -1149,13 +1176,18 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx) _("Computing commit changed paths Bloom filters"), ctx->commits.nr); + ALLOC_ARRAY(sorted_commits, ctx->commits.nr); + COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr); + QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp); + for (i = 0; i < ctx->commits.nr; i++) { - struct commit *c = ctx->commits.list[i]; + struct commit *c = sorted_commits[i]; struct bloom_filter *filter = get_bloom_filter(ctx->r, c); ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len; display_progress(progress, i + 1); } + free(sorted_commits); stop_progress(&progress); } From patchwork Mon Apr 6 16:59:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475919 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5CD1B1392 for ; Mon, 6 Apr 2020 17:00:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3787021841 for ; Mon, 6 Apr 2020 17:00:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="u2CGzb64" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729757AbgDFRAJ (ORCPT ); Mon, 6 Apr 2020 13:00:09 -0400 Received: from mail-ed1-f65.google.com ([209.85.208.65]:42306 "EHLO mail-ed1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729646AbgDFRAG (ORCPT ); Mon, 6 Apr 2020 13:00:06 -0400 Received: by mail-ed1-f65.google.com with SMTP id cw6so346008edb.9 for ; Mon, 06 Apr 2020 10:00:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=lKRmXl/SCDI8E+Ga3COYKCQR4IKS00g3pKKk/jGMaB4=; b=u2CGzb64qEPOgS75JqIJowWLwPr7cp0uUE0B2q3GHmjRgMdVec7E0dlU3xPMpkqh9h 4duRIoY3PPFijbpTritfDyG4sSPwO16Pfrq40jjQjwQH6DwcL0bNJ3vtp7yKu505q2lB eo0JdAgUDq+wnrbJ6oevG7uI6qt15q4kYJWlig7OVBGM3hIfVgJjdlTCAO04SICOUf0n kzKVqZ1JsF0brujnqiJx0K+PRZuIyqJzIuYqSzwVLUFFmneFydE10g31alh8pWckT26J brSxrwlX/E7eABHhHOBkfnxHZU76k/USLC7LAih2BQ9eJtOF76fBI8d9y2JyINazuG/w vruA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=lKRmXl/SCDI8E+Ga3COYKCQR4IKS00g3pKKk/jGMaB4=; b=bbf5T6b+wxnZbodrtrmLf8nWUf+Tc7L145T5EoIKvaZjboxLyND7TRJzGbp1GdwBc6 Tz/oBnSh1GOYBfx8pWtS+fpmMVr+O6czlqiP41VLPuuMxFN+t9tHwUlaIpW9xDa8NOc3 u8lhRXfXg+HxTR6h7atnCkiLJMAR1PAr3JNF7qmohKdsUZDyKtfp5iN6MWE6iZx86svK LaBp7a6ua5V21jgOO8BS94CbNrqs44YRX+TEHoDRDFDCbw4kIBG6Pt0hDUTKkxD4ChY4 CQkFB2WrU6P7HBKY1Czzb5t5F23/q0iW0NVitIz1w2KII6a1NpceTAo8KUmtkcYUZ2sG wi6Q== X-Gm-Message-State: AGi0PubFNgSVNy1UohteFdQkAtuYN+vAFrvjFSrtyntS7pa1PmMnLg6f 4UZSy4NL/HRK4kBnoRKq89xPp/w9 X-Google-Smtp-Source: APiQypKGtUGO9jRLMALnVqGDh+2Q6Y6AsM861Y78pc/2MCoFDmC+h84GTzKKXXgtf0u+FBX4gCM00A== X-Received: by 2002:a50:c341:: with SMTP id q1mr20797761edb.247.1586192402982; Mon, 06 Apr 2020 10:00:02 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id b11sm2461524edj.20.2020.04.06.10.00.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:02 -0700 (PDT) Message-Id: <5ed16f35fed43018ce441adfff55b85967d3918c.1586192395.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:48 +0000 Subject: [PATCH v4 08/15] commit-graph: examine commits by generation number Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh When running 'git commit-graph write --changed-paths', we sort the commits by pack-order to save time when computing the changed-paths bloom filters. This does not help when finding the commits via the '--reachable' flag. If not using pack-order, then sort by generation number before examining the diff. Commits with similar generation are more likely to have many trees in common, making the diff faster. On the Linux kernel repository, this change reduced the computation time for 'git commit-graph write --reachable --changed-paths' from 3m00s to 1m37s. Helped-by: Jeff King Signed-off-by: Derrick Stolee Signed-off-by: Garima Singh --- commit-graph.c | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 31b06f878ce..732c81fa1b2 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -70,6 +70,25 @@ static int commit_pos_cmp(const void *va, const void *vb) commit_pos_at(&commit_pos, b); } +static int commit_gen_cmp(const void *va, const void *vb) +{ + const struct commit *a = *(const struct commit **)va; + const struct commit *b = *(const struct commit **)vb; + + /* lower generation commits first */ + if (a->generation < b->generation) + return -1; + else if (a->generation > b->generation) + return 1; + + /* use date as a heuristic when generations are equal */ + if (a->date < b->date) + return -1; + else if (a->date > b->date) + return 1; + return 0; +} + char *get_commit_graph_filename(struct object_directory *obj_dir) { return xstrfmt("%s/info/commit-graph", obj_dir->path); @@ -815,7 +834,8 @@ struct write_commit_graph_context { report_progress:1, split:1, check_oids:1, - changed_paths:1; + changed_paths:1, + order_by_pack:1; const struct split_commit_graph_opts *split_opts; size_t total_bloom_filter_data_size; @@ -1178,7 +1198,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx) ALLOC_ARRAY(sorted_commits, ctx->commits.nr); COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr); - QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp); + + if (ctx->order_by_pack) + QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp); + else + QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp); for (i = 0; i < ctx->commits.nr; i++) { struct commit *c = sorted_commits[i]; @@ -1884,6 +1908,7 @@ int write_commit_graph(struct object_directory *odb, } if (pack_indexes) { + ctx->order_by_pack = 1; if ((res = fill_oids_from_packs(ctx, pack_indexes))) goto cleanup; } @@ -1893,8 +1918,10 @@ int write_commit_graph(struct object_directory *odb, goto cleanup; } - if (!pack_indexes && !commit_hex) + if (!pack_indexes && !commit_hex) { + ctx->order_by_pack = 1; fill_oids_from_all_packs(ctx); + } close_reachable(ctx); From patchwork Mon Apr 6 16:59:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475921 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BC4771871 for ; Mon, 6 Apr 2020 17:00:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 902B5224F9 for ; Mon, 6 Apr 2020 17:00:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ixCdnDU/" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729750AbgDFRAI (ORCPT ); Mon, 6 Apr 2020 13:00:08 -0400 Received: from mail-ed1-f66.google.com ([209.85.208.66]:42309 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729705AbgDFRAH (ORCPT ); Mon, 6 Apr 2020 13:00:07 -0400 Received: by mail-ed1-f66.google.com with SMTP id cw6so346067edb.9 for ; Mon, 06 Apr 2020 10:00:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=RzDq0RVy9u/llzNNfsL9IhgF26sWW+4SNXyurg5TCzc=; b=ixCdnDU/PksEXMyv3ow9sIsu4O62AvAzHFr7K/FjvVHd+X72vGpqu6GzQrl5IeIcba lVPDtxrAqYRogFKNQLLVQgLI/Lx5kaYOJ0/KMcwE83DMenMXrZCRFq3VF87WPkC45pMF LzpGlDw8MW1+f9tvbfu8jumw2iGH5u52lGaDnnZlli7N3BwrFeb7jNSwZyOObVklbno8 TJsvNWrZH9/BCCJgp3/y9athMluUhOt6Q5121n6S9X6YNEafiA7BTua8P2/4lIiUbMOB cPCJ2rH2fjPJNi5HsnDhnuJnkMOCW/lGmpEDVSYukXWWTEEW7l9c4XLNnBka2LmB1LHU vdGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=RzDq0RVy9u/llzNNfsL9IhgF26sWW+4SNXyurg5TCzc=; b=TUn9VvtnvZK7WZGi//1nnjPTHXqrWrq5U+RDroC7YEL4iBV1djpX1EfUGi/h1oAoLU ls0RUWnpAz9y3+sADUmHCTOeiptFR7DxIJFuEzQrPyVDnz8y81iI7M7aOYg9Y6XzdtDU Lh4PPR1QzW5amaUkfEY8CYHzLeYvFQ4/2339HQHa4zc2ZTz+6FbqXw1+DckfqIxcUu3c +eKTe9ZARQIJAnQ20ZLijSjICQzv0XNsv2BPFIr+PU8m4d0pwKXpeTdYIccY3hmDeFE7 HTyAleHr4zIc8u8aPjxYkSfM5a3eYFGNsoljw0axIM0IxLmH3AzlVa24tBMqIeZHiA4H +lAw== X-Gm-Message-State: AGi0Pua9Or6hp+joXYif2Zo4sqRUlBdEiBXVK/GYlEDaQ7smt925im/E zxgMdqgKB8lDVB99fk65i7OPsKi1 X-Google-Smtp-Source: APiQypLyIUg2FeOkF8J5NDUhqP2pzIWPlqrS1p2AwxH9dtTlIXWvIEdYPn18NYJzRdLbxggMaqm8Zw== X-Received: by 2002:a50:9dc4:: with SMTP id l4mr20448726edk.234.1586192403670; Mon, 06 Apr 2020 10:00:03 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id k19sm837342eds.32.2020.04.06.10.00.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:03 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:49 +0000 Subject: [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh Update the technical documentation for commit-graph-format with the formats for the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the computed Bloom filters information to the commit graph file using this format. Helped-by: Derrick Stolee Signed-off-by: Garima Singh --- .../technical/commit-graph-format.txt | 30 +++++ commit-graph.c | 113 +++++++++++++++++- commit-graph.h | 5 + 3 files changed, 147 insertions(+), 1 deletion(-) diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt index a4f17441aed..de56f9f1efd 100644 --- a/Documentation/technical/commit-graph-format.txt +++ b/Documentation/technical/commit-graph-format.txt @@ -17,6 +17,9 @@ metadata, including: - The parents of the commit, stored using positional references within the graph file. +- The Bloom filter of the commit carrying the paths that were changed between + the commit and its first parent, if requested. + These positional references are stored as unsigned 32-bit integers corresponding to the array position within the list of commit OIDs. Due to some special constants we use to track parents, we can store at most @@ -93,6 +96,33 @@ CHUNK DATA: positions for the parents until reaching a value with the most-significant bit on. The other bits correspond to the position of the last parent. + Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional] + * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all + Bloom filters from commit 0 to commit i (inclusive) in lexicographic + order. The Bloom filter for the i-th commit spans from BIDX[i-1] to + BIDX[i] (plus header length), where BIDX[-1] is 0. + * The BIDX chunk is ignored if the BDAT chunk is not present. + + Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional] + * It starts with header consisting of three unsigned 32-bit integers: + - Version of the hash algorithm being used. We currently only support + value 1 which corresponds to the 32-bit version of the murmur3 hash + implemented exactly as described in + https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double + hashing technique using seed values 0x293ae76f and 0x7e646e2 as + described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters + in Probabilistic Verification" + - The number of times a path is hashed and hence the number of bit positions + that cumulatively determine whether a file is present in the commit. + - The minimum number of bits 'b' per entry in the Bloom filter. If the filter + contains 'n' entries, then the filter size is the minimum number of 64-bit + words that contain n*b bits. + * The rest of the chunk is the concatenation of all the computed Bloom + filters for the commits in lexicographic order. + * Note: Commits with no changes or more than 512 changes have Bloom filters + of length zero. + * The BDAT chunk is present if and only if BIDX is present. + Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional] This list of H-byte hashes describe a set of B commit-graph files that form a commit-graph chain. The graph position for the ith commit in this diff --git a/commit-graph.c b/commit-graph.c index 732c81fa1b2..a8b6b5cca5d 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -24,8 +24,10 @@ #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */ +#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */ +#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */ #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */ -#define MAX_NUM_CHUNKS 5 +#define MAX_NUM_CHUNKS 7 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16) @@ -319,6 +321,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, chunk_repeated = 1; else graph->chunk_base_graphs = data + chunk_offset; + break; + + case GRAPH_CHUNKID_BLOOMINDEXES: + if (graph->chunk_bloom_indexes) + chunk_repeated = 1; + else + graph->chunk_bloom_indexes = data + chunk_offset; + break; + + case GRAPH_CHUNKID_BLOOMDATA: + if (graph->chunk_bloom_data) + chunk_repeated = 1; + else { + uint32_t hash_version; + graph->chunk_bloom_data = data + chunk_offset; + hash_version = get_be32(data + chunk_offset); + + if (hash_version != 1) + break; + + graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings)); + graph->bloom_filter_settings->hash_version = hash_version; + graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4); + graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8); + } + break; } if (chunk_repeated) { @@ -337,6 +365,15 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd, last_chunk_offset = chunk_offset; } + if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) { + init_bloom_filters(); + } else { + /* We need both the bloom chunks to exist together. Else ignore the data */ + graph->chunk_bloom_indexes = NULL; + graph->chunk_bloom_data = NULL; + graph->bloom_filter_settings = NULL; + } + hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len); if (verify_commit_graph_lite(graph)) { @@ -1034,6 +1071,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f, } } +static void write_graph_chunk_bloom_indexes(struct hashfile *f, + struct write_commit_graph_context *ctx) +{ + struct commit **list = ctx->commits.list; + struct commit **last = ctx->commits.list + ctx->commits.nr; + uint32_t cur_pos = 0; + struct progress *progress = NULL; + int i = 0; + + if (ctx->report_progress) + progress = start_delayed_progress( + _("Writing changed paths Bloom filters index"), + ctx->commits.nr); + + while (list < last) { + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); + cur_pos += filter->len; + display_progress(progress, ++i); + hashwrite_be32(f, cur_pos); + list++; + } + + stop_progress(&progress); +} + +static void write_graph_chunk_bloom_data(struct hashfile *f, + struct write_commit_graph_context *ctx, + const struct bloom_filter_settings *settings) +{ + struct commit **list = ctx->commits.list; + struct commit **last = ctx->commits.list + ctx->commits.nr; + struct progress *progress = NULL; + int i = 0; + + if (ctx->report_progress) + progress = start_delayed_progress( + _("Writing changed paths Bloom filters data"), + ctx->commits.nr); + + hashwrite_be32(f, settings->hash_version); + hashwrite_be32(f, settings->num_hashes); + hashwrite_be32(f, settings->bits_per_entry); + + while (list < last) { + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); + display_progress(progress, ++i); + hashwrite(f, filter->data, filter->len * sizeof(unsigned char)); + list++; + } + + stop_progress(&progress); +} + static int oid_compare(const void *_a, const void *_b) { const struct object_id *a = (const struct object_id *)_a; @@ -1438,6 +1528,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) struct strbuf progress_title = STRBUF_INIT; int num_chunks = 3; struct object_id file_hash; + const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS; if (ctx->split) { struct strbuf tmp_file = STRBUF_INIT; @@ -1482,6 +1573,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES; num_chunks++; } + if (ctx->changed_paths) { + chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES; + num_chunks++; + chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA; + num_chunks++; + } if (ctx->num_commit_graphs_after > 1) { chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE; num_chunks++; @@ -1500,6 +1597,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) 4 * ctx->num_extra_edges; num_chunks++; } + if (ctx->changed_paths) { + chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + + sizeof(uint32_t) * ctx->commits.nr; + num_chunks++; + + chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + + sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size; + num_chunks++; + } if (ctx->num_commit_graphs_after > 1) { chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + hashsz * (ctx->num_commit_graphs_after - 1); @@ -1537,6 +1643,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) write_graph_chunk_data(f, hashsz, ctx); if (ctx->num_extra_edges) write_graph_chunk_extra_edges(f, ctx); + if (ctx->changed_paths) { + write_graph_chunk_bloom_indexes(f, ctx); + write_graph_chunk_bloom_data(f, ctx, &bloom_settings); + } if (ctx->num_commit_graphs_after > 1 && write_graph_chunk_base(f, ctx)) { return -1; @@ -2184,6 +2294,7 @@ void free_commit_graph(struct commit_graph *g) close(g->graph_fd); } free(g->filename); + free(g->bloom_filter_settings); free(g); } diff --git a/commit-graph.h b/commit-graph.h index 86be81219da..8e7a8e0e5b2 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -11,6 +11,7 @@ #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD" struct commit; +struct bloom_filter_settings; char *get_commit_graph_filename(struct object_directory *odb); int open_commit_graph(const char *graph_file, int *fd, struct stat *st); @@ -59,6 +60,10 @@ struct commit_graph { const unsigned char *chunk_commit_data; const unsigned char *chunk_extra_edges; const unsigned char *chunk_base_graphs; + const unsigned char *chunk_bloom_indexes; + const unsigned char *chunk_bloom_data; + + struct bloom_filter_settings *bloom_filter_settings; }; struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st, From patchwork Mon Apr 6 16:59:50 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475917 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8C3CB1392 for ; Mon, 6 Apr 2020 17:00:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5E889233EB for ; Mon, 6 Apr 2020 17:00:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="pIhbfxoZ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729753AbgDFRAI (ORCPT ); Mon, 6 Apr 2020 13:00:08 -0400 Received: from mail-ed1-f66.google.com ([209.85.208.66]:38520 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729731AbgDFRAH (ORCPT ); Mon, 6 Apr 2020 13:00:07 -0400 Received: by mail-ed1-f66.google.com with SMTP id e5so379171edq.5 for ; Mon, 06 Apr 2020 10:00:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=mCODFr8ve01DYmbjK4Fo96EeGpzXWBRbaONvoDMgaPc=; b=pIhbfxoZPxBcEUz5EIbb02gjXhmnqeG5T1wG+L9ulk34q1GbkzwpLks26sXMyxzRyB 231L3n39nA62LTFDMPRZcQb3/5Ma1Lv6aWHo2PZsc2InDe7dVWOyrgVLtXZKcN/mG7Ot D3+OHKe5+9yal1zpQdWrvck5PakcBQQQq2ejSzXUnhaQK0wniqYVQ6Hiw4tXlGE52itu UmVt5g0noq+Z8mpTOAGmUkvDaQzeXSvKwE2SNEw0jhL7AwCEiNPsS1gu0U29Ziyyz74i miDJrHyaZoKUcY70Ng3sttb8P0Ij8Wj3oO3QUXt3VUFuGLpXWWM8URAOU+wv1xBArNck Dk1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=mCODFr8ve01DYmbjK4Fo96EeGpzXWBRbaONvoDMgaPc=; b=Vevu2zcD8c0xFYO7HlwvKWeu3/7DS3a9sGDNWLZpl7JVXlpwByaQBMBXv1FNd9lYUG 101+Ikf2q6jw7yfbMo5YsqK95MOJuiprZXaen7/sFy7EZLzescTAUSl9dICMHBa3mRHh ywbeHi0avEFhxebuNRC17YCFW6+U5gwFX/WtOusYUMlNzfeKu7OrqBlX9e6+5dK78zy+ 3MAC5LFFYKidoT9Ugf772abPKDMwPUSnNDMSMcazoxK2o++32lbWooeO6Z9iMJr5PtKf UXpJa/Wav+zhuQOj5yWyG6GUQ2ZV7RVpTW4fEkE9nX+S1zge44dJcUSw4NvPVeqObCwa 0S9w== X-Gm-Message-State: AGi0PuYO0Ft2452qYhU/74GhocPDG6zGkvuj149oBXUEzZDP8W8X0YPX N8PFc8/CITLZaomXjrENgzYu4nE1 X-Google-Smtp-Source: APiQypJeowY5aCnRS8Sa64EwGbFIlz3rV1przZa8Xm2pCub/0oxQGP9V0+9+eUdhTvhpXPrT0JATWg== X-Received: by 2002:a17:906:7d4:: with SMTP id m20mr466804ejc.10.1586192404331; Mon, 06 Apr 2020 10:00:04 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id 21sm2958780ejz.10.2020.04.06.10.00.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:03 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:50 +0000 Subject: [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh Add logic to a) parse Bloom filter information from the commit graph file and, b) re-use existing Bloom filters. See Documentation/technical/commit-graph-format for the format in which the Bloom filter information is written to the commit graph file. To read Bloom filter for a given commit with lexicographic position 'i' we need to: 1. Read BIDX[i] which essentially gives us the starting index in BDAT for filter of commit i+1. It is essentially the index past the end of the filter of commit i. It is called end_index in the code. 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT for filter of commit i. It is called the start_index in the code. For the first commit, where i = 0, Bloom filter data starts at the beginning, just past the header in the BDAT chunk. Hence, start_index will be 0. 3. The length of the filter will be end_index - start_index, because BIDX[i] gives the cumulative 8-byte words including the ith commit's filter. We toggle whether Bloom filters should be recomputed based on the compute_if_not_present flag. Helped-by: Derrick Stolee Signed-off-by: Garima Singh --- bloom.c | 49 ++++++++++++++++++++++++++++++++++++++++++- bloom.h | 4 +++- commit-graph.c | 6 +++--- t/helper/test-bloom.c | 2 +- 4 files changed, 55 insertions(+), 6 deletions(-) diff --git a/bloom.c b/bloom.c index a16eee92331..0f714dd76ae 100644 --- a/bloom.c +++ b/bloom.c @@ -4,6 +4,8 @@ #include "diffcore.h" #include "revision.h" #include "hashmap.h" +#include "commit-graph.h" +#include "commit.h" define_commit_slab(bloom_filter_slab, struct bloom_filter); @@ -26,6 +28,36 @@ static inline unsigned char get_bitmask(uint32_t pos) return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1)); } +static int load_bloom_filter_from_graph(struct commit_graph *g, + struct bloom_filter *filter, + struct commit *c) +{ + uint32_t lex_pos, start_index, end_index; + + while (c->graph_pos < g->num_commits_in_base) + g = g->base_graph; + + /* The commit graph commit 'c' lives in doesn't carry bloom filters. */ + if (!g->chunk_bloom_indexes) + return 0; + + lex_pos = c->graph_pos - g->num_commits_in_base; + + end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos); + + if (lex_pos > 0) + start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1)); + else + start_index = 0; + + filter->len = end_index - start_index; + filter->data = (unsigned char *)(g->chunk_bloom_data + + sizeof(unsigned char) * start_index + + BLOOMDATA_CHUNK_HEADER_SIZE); + + return 1; +} + /* * Calculate the murmur3 32-bit hash value for the given data * using the given seed. @@ -127,7 +159,8 @@ void init_bloom_filters(void) } struct bloom_filter *get_bloom_filter(struct repository *r, - struct commit *c) + struct commit *c, + int compute_if_not_present) { struct bloom_filter *filter; struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS; @@ -140,6 +173,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r, filter = bloom_filter_slab_at(&bloom_filters, c); + if (!filter->data) { + load_commit_graph_info(r, c); + if (c->graph_pos != COMMIT_NOT_FROM_GRAPH && + r->objects->commit_graph->chunk_bloom_indexes) { + if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c)) + return filter; + else + return NULL; + } + } + + if (filter->data || !compute_if_not_present) + return filter; + repo_diff_setup(r, &diffopt); diffopt.flags.recursive = 1; diffopt.max_changes = max_changes; diff --git a/bloom.h b/bloom.h index 85ab8e9423d..760d7122374 100644 --- a/bloom.h +++ b/bloom.h @@ -32,6 +32,7 @@ struct bloom_filter_settings { #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 } #define BITS_PER_WORD 8 +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t) /* * A bloom_filter struct represents a data segment to @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key, void init_bloom_filters(void); struct bloom_filter *get_bloom_filter(struct repository *r, - struct commit *c); + struct commit *c, + int compute_if_not_present); #endif \ No newline at end of file diff --git a/commit-graph.c b/commit-graph.c index a8b6b5cca5d..77668629e27 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f, ctx->commits.nr); while (list < last) { - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0); cur_pos += filter->len; display_progress(progress, ++i); hashwrite_be32(f, cur_pos); @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f, hashwrite_be32(f, settings->bits_per_entry); while (list < last) { - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0); display_progress(progress, ++i); hashwrite(f, filter->data, filter->len * sizeof(unsigned char)); list++; @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx) for (i = 0; i < ctx->commits.nr; i++) { struct commit *c = sorted_commits[i]; - struct bloom_filter *filter = get_bloom_filter(ctx->r, c); + struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1); ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len; display_progress(progress, i + 1); } diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c index f18d1b722e1..ce412664ba9 100644 --- a/t/helper/test-bloom.c +++ b/t/helper/test-bloom.c @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid) struct bloom_filter *filter; setup_git_directory(); c = lookup_commit(the_repository, commit_oid); - filter = get_bloom_filter(the_repository, c); + filter = get_bloom_filter(the_repository, c, 1); print_bloom_filter(filter); } From patchwork Mon Apr 6 16:59:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475929 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9B0E91392 for ; Mon, 6 Apr 2020 17:00:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7A4BD23DC1 for ; Mon, 6 Apr 2020 17:00:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fT+5KPSl" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729767AbgDFRAO (ORCPT ); Mon, 6 Apr 2020 13:00:14 -0400 Received: from mail-ed1-f67.google.com ([209.85.208.67]:44548 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729755AbgDFRAK (ORCPT ); Mon, 6 Apr 2020 13:00:10 -0400 Received: by mail-ed1-f67.google.com with SMTP id i16so331196edy.11 for ; Mon, 06 Apr 2020 10:00:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=pg2pmkJfOHr9lfHzYyqsIOuMzCf7SUS5JtSNcQKLUIU=; b=fT+5KPSltT6FsVjk+FxWqyxXKVJHOneDR3l0dXeLgVvIslq9DCFk47377bC0o4aY8Z rrx5QFY/bGbhE5cQue3Dv3mIgyMFEf1Ujx2HBZdCCDR4Zqodzi905rlfzcpujPLXbHTm E2rAT5aNaHuxUTv397kl8Yrz1bSHNYEe7/08O1gO2fkUWpbovfvzGmRCIOW7zAXrV0Dx BKTudbv5Wop6X9KUAIIu1gjsuQTyMuZf/1SVidrPU9fzOP/nlKDUc65geOdc5UbN7C7d XWfutEVZQtbd6NJ9QlT1ztmnMyjt0YIp5rQSCfcEnLXAnBBOgNQh0uyHTkLTeLSYMDa2 ckRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=pg2pmkJfOHr9lfHzYyqsIOuMzCf7SUS5JtSNcQKLUIU=; b=OqqHJxZ0bSsYxXz+naZGXiNmXigCHcAkh4vlZWkKMbc88S3mU10w+ARv2tQY097m+b o7a3FgkxfWR+I2mUvFRRdR69joYEezaeFAujjmyp0P1+GC1R0CZkWod11wePvLLO6Vjq cvoOVD1Kjtki4e/tuLQ0MwxXJ32Pxx9MDARcpm6qTkGyxnil5MxdV8785A+qK1AUNmnT C94SE0et05LmhJGf2mi1DIj/iPxivtt9BSxeT6Co7rtDBDN+zJeknoIbAwJyUdc0UGXV bLUenmKzecGvccQ3+NR7iG5OdADmgsX8OM00kuHM0tP69s6SV8aZb0zEldRsBrkuOgse xBgg== X-Gm-Message-State: AGi0PuZ1nnhBorl4SPiKc1vkz4uaI95FRLapik4AOFZZyAqfsp8DrKHg mj2u9fEJuSokAIGLyVwOqKFhKq8j X-Google-Smtp-Source: APiQypLZ6CT/TZay+MXZNxaeHYFtTn63QitOfWWoimMfXViCm2YkwrAcb4xi73QHj8gLu/56M98TPA== X-Received: by 2002:a50:d987:: with SMTP id w7mr19731582edj.276.1586192405227; Mon, 06 Apr 2020 10:00:05 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id p4sm2946621eju.57.2020.04.06.10.00.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:04 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:51 +0000 Subject: [PATCH v4 11/15] commit-graph: add --changed-paths option to write subcommand Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh Add --changed-paths option to git commit-graph write. This option will allow users to compute information about the paths that have changed between a commit and its first parent, and write it into the commit graph file. If the option is passed to the write subcommand we set the COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the commit-graph logic. Helped-by: Derrick Stolee Signed-off-by: Garima Singh --- Documentation/git-commit-graph.txt | 5 +++++ builtin/commit-graph.c | 9 +++++++-- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 28d1fee5053..f4b13c005b8 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -57,6 +57,11 @@ or `--stdin-packs`.) With the `--append` option, include all commits that are present in the existing commit-graph file. + +With the `--changed-paths` option, compute and write information about the +paths changed between a commit and it's first parent. This operation can +take a while on large repositories. It provides significant performance gains +for getting history of a directory or a file with `git log -- `. ++ With the `--split` option, write the commit-graph as a chain of multiple commit-graph files stored in `/info/commit-graphs`. The new commits not already in the commit-graph are added in a new "tip" file. This file diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index d1ab6625f63..cacb5d04a80 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph verify [--object-dir ] [--shallow] [--[no-]progress]"), - N_("git commit-graph write [--object-dir ] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] "), + N_("git commit-graph write [--object-dir ] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] "), NULL }; @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] "), + N_("git commit-graph write [--object-dir ] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] "), NULL }; @@ -32,6 +32,7 @@ static struct opts_commit_graph { int split; int shallow; int progress; + int enable_changed_paths; } opts; static struct object_directory *find_odb(struct repository *r, @@ -135,6 +136,8 @@ static int graph_write(int argc, const char **argv) N_("start walk at commits listed by stdin")), OPT_BOOL(0, "append", &opts.append, N_("include all commits already in the commit-graph file")), + OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths, + N_("enable computation for changed paths")), OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")), OPT_BOOL(0, "split", &opts.split, N_("allow writing an incremental commit-graph file")), @@ -168,6 +171,8 @@ static int graph_write(int argc, const char **argv) flags |= COMMIT_GRAPH_WRITE_SPLIT; if (opts.progress) flags |= COMMIT_GRAPH_WRITE_PROGRESS; + if (opts.enable_changed_paths) + flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS; read_replace_refs = 0; odb = find_odb(the_repository, opts.obj_dir); From patchwork Mon Apr 6 16:59:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475927 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A32441392 for ; Mon, 6 Apr 2020 17:00:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 75CAB23BCF for ; Mon, 6 Apr 2020 17:00:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NxHYhA4Z" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729765AbgDFRAN (ORCPT ); Mon, 6 Apr 2020 13:00:13 -0400 Received: from mail-ed1-f44.google.com ([209.85.208.44]:34299 "EHLO mail-ed1-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729751AbgDFRAL (ORCPT ); Mon, 6 Apr 2020 13:00:11 -0400 Received: by mail-ed1-f44.google.com with SMTP id o1so413182edv.1 for ; Mon, 06 Apr 2020 10:00:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=UMGJmkbsrjpxKULyJuJ+crf4sLy929KNZgSw5qQfemc=; b=NxHYhA4ZmXhXKA9pgExOz6B/CPXjwStfS+A5t8BHas87rcbLLOPpBO1l6rW1DiVB0n tbmZzKgBhl4JsBL5lemNqM5gzVHIPAJgZtSXoDSqX8PU3NpIjSzxTnwGN7T/ICPP+a30 FKw5JpLiGwXeF3C5Fou+nFxnBSm3ZdAH2gACISbANGG2e5zhPgyqDAP/UCj2Ez0Z4UKm Z8S8trKgDR3ElxeexeX2cjfy+Xlj0D62R124YwcDRjjyjxEgt9HyFufsYH6yf3FtPNPG wM469wAHUav9npzbQN/uBROpVYDDbvWZwke+UH3UOKPcIXdWp6mEl9nUAzfbTLBH2OeE CFjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=UMGJmkbsrjpxKULyJuJ+crf4sLy929KNZgSw5qQfemc=; b=EzayQer6Ppp/BVf9vrUGU/ZQTqCezHKedqQ+GQWwHr64wqUEKkFMS15Vr0jmKGyiT5 I2qjLmbmsj/na5A/7G1/bLxJPD+HekEnPwzEcTZAiNICwcoWYvfNgOLE/Chvejw8CLN8 POMosAtxsFmyA2VBo9kqM4JJVrWy203GA2ap+HLxI//eJ3vESiU59SvU/qmqwZdg68IM WVO7wzp0DTWP/0cxonsnHQ4FSnO3cLDYqjQyRFVdWTniOyp1TZpOt+eRuqk3i5S7pSIn SB4yFnSmSdczjWJgkQSuVMCwIpgYClB1FlkkS7zKadvrZ1Ddbnb2L4JRaDOHfon+2fzJ MsZw== X-Gm-Message-State: AGi0PuYPXNMcglxoU9y5EexB+GFDZHytxvN6bb1n6G0WsxgwRvX0TRne eFBxzyon/y/JFIScZHCbbg6C7f81 X-Google-Smtp-Source: APiQypK+v9JajQnqiZSIx/e7Pskk4ckqDbjHvtSWywsei/lMJB42HN0VMwaGPQMt9oGlfjNjquZpOA== X-Received: by 2002:a17:906:8689:: with SMTP id g9mr398413ejx.345.1586192406038; Mon, 06 Apr 2020 10:00:06 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id p25sm3053001ejw.49.2020.04.06.10.00.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:05 -0700 (PDT) Message-Id: <617f549ef259424658a84dd67a98685328f6b850.1586192395.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:52 +0000 Subject: [PATCH v4 12/15] revision.c: use Bloom filters to speed up path based revision walks MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh Revision walk will now use Bloom filters for commits to speed up revision walks for a particular path (for computing history for that path), if they are present in the commit-graph file. We load the Bloom filters during the prepare_revision_walk step, currently only when dealing with a single pathspec. Extending it to work with multiple pathspecs can be explored and built on top of this series in the future. While comparing trees in rev_compare_trees(), if the Bloom filter says that the file is not different between the two trees, we don't need to compute the expensive diff. This is where we get our performance gains. The other response of the Bloom filter is '`:maybe', in which case we fall back to the full diff calculation to determine if the path was changed in the commit. We do not try to use Bloom filters when the '--walk-reflogs' option is specified. The '--walk-reflogs' option does not walk the commit ancestry chain like the rest of the options. Incorporating the performance gains when walking reflog entries would add more complexity, and can be explored in a later series. Performance Gains: We tested the performance of `git log -- ` on the git repo, the linux and some internal large repos, with a variety of paths of varying depths. On the git and linux repos: - we observed a 2x to 5x speed up. On a large internal repo with files seated 6-10 levels deep in the tree: - we observed 10x to 20x speed ups, with some paths going up to 28 times faster. Helped-by: Derrick Stolee Helped-by: Jonathan Tan Signed-off-by: Garima Singh --- bloom.c | 20 +++++++++++++ bloom.h | 4 +++ revision.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-- revision.h | 11 +++++++ 4 files changed, 118 insertions(+), 2 deletions(-) diff --git a/bloom.c b/bloom.c index 0f714dd76ae..c5b461d1cfe 100644 --- a/bloom.c +++ b/bloom.c @@ -253,3 +253,23 @@ struct bloom_filter *get_bloom_filter(struct repository *r, return filter; } + +int bloom_filter_contains(const struct bloom_filter *filter, + const struct bloom_key *key, + const struct bloom_filter_settings *settings) +{ + int i; + uint64_t mod = filter->len * BITS_PER_WORD; + + if (!mod) + return -1; + + for (i = 0; i < settings->num_hashes; i++) { + uint64_t hash_mod = key->hashes[i] % mod; + uint64_t block_pos = hash_mod / BITS_PER_WORD; + if (!(filter->data[block_pos] & get_bitmask(hash_mod))) + return 0; + } + + return 1; +} \ No newline at end of file diff --git a/bloom.h b/bloom.h index 760d7122374..b935186425d 100644 --- a/bloom.h +++ b/bloom.h @@ -83,4 +83,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c, int compute_if_not_present); +int bloom_filter_contains(const struct bloom_filter *filter, + const struct bloom_key *key, + const struct bloom_filter_settings *settings); + #endif \ No newline at end of file diff --git a/revision.c b/revision.c index 8136929e236..d3fcb7c6ff6 100644 --- a/revision.c +++ b/revision.c @@ -29,6 +29,7 @@ #include "prio-queue.h" #include "hashmap.h" #include "utf8.h" +#include "bloom.h" volatile show_early_output_fn_t show_early_output; @@ -624,11 +625,80 @@ static void file_change(struct diff_options *options, options->flags.has_changes = 1; } +static void prepare_to_use_bloom_filter(struct rev_info *revs) +{ + struct pathspec_item *pi; + char *path_alloc = NULL; + const char *path; + int last_index; + int len; + + if (!revs->commits) + return; + + repo_parse_commit(revs->repo, revs->commits->item); + + if (!revs->repo->objects->commit_graph) + return; + + revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings; + if (!revs->bloom_filter_settings) + return; + + pi = &revs->pruning.pathspec.items[0]; + last_index = pi->len - 1; + + /* remove single trailing slash from path, if needed */ + if (pi->match[last_index] == '/') { + path_alloc = xstrdup(pi->match); + path_alloc[last_index] = '\0'; + path = path_alloc; + } else + path = pi->match; + + len = strlen(path); + + revs->bloom_key = xmalloc(sizeof(struct bloom_key)); + fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); + + free(path_alloc); +} + +static int check_maybe_different_in_bloom_filter(struct rev_info *revs, + struct commit *commit) +{ + struct bloom_filter *filter; + int result; + + if (!revs->repo->objects->commit_graph) + return -1; + + if (commit->generation == GENERATION_NUMBER_INFINITY) + return -1; + + filter = get_bloom_filter(revs->repo, commit, 0); + + if (!filter) { + return -1; + } + + if (!filter->len) { + return -1; + } + + result = bloom_filter_contains(filter, + revs->bloom_key, + revs->bloom_filter_settings); + + return result; +} + static int rev_compare_tree(struct rev_info *revs, - struct commit *parent, struct commit *commit) + struct commit *parent, struct commit *commit, int nth_parent) { struct tree *t1 = get_commit_tree(parent); struct tree *t2 = get_commit_tree(commit); + int bloom_ret = 1; if (!t1) return REV_TREE_NEW; @@ -653,11 +723,19 @@ static int rev_compare_tree(struct rev_info *revs, return REV_TREE_SAME; } + if (revs->bloom_key && !nth_parent) { + bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); + + if (bloom_ret == 0) + return REV_TREE_SAME; + } + tree_difference = REV_TREE_SAME; revs->pruning.flags.has_changes = 0; if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "", &revs->pruning) < 0) return REV_TREE_DIFFERENT; + return tree_difference; } @@ -855,7 +933,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit) die("cannot simplify commit %s (because of %s)", oid_to_hex(&commit->object.oid), oid_to_hex(&p->object.oid)); - switch (rev_compare_tree(revs, p, commit)) { + switch (rev_compare_tree(revs, p, commit, nth_parent)) { case REV_TREE_SAME: if (!revs->simplify_history || !relevant_commit(p)) { /* Even if a merge with an uninteresting @@ -3362,6 +3440,8 @@ int prepare_revision_walk(struct rev_info *revs) FOR_EACH_OBJECT_PROMISOR_ONLY); } + if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info) + prepare_to_use_bloom_filter(revs); if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED) commit_list_sort_by_date(&revs->commits); if (revs->no_walk) @@ -3379,6 +3459,7 @@ int prepare_revision_walk(struct rev_info *revs) simplify_merges(revs); if (revs->children.name) set_children(revs); + return 0; } diff --git a/revision.h b/revision.h index 475f048fb61..7c026fe41fc 100644 --- a/revision.h +++ b/revision.h @@ -56,6 +56,8 @@ struct repository; struct rev_info; struct string_list; struct saved_parents; +struct bloom_key; +struct bloom_filter_settings; define_shared_commit_slab(revision_sources, char *); struct rev_cmdline_info { @@ -291,6 +293,15 @@ struct rev_info { struct revision_sources *sources; struct topo_walk_info *topo_walk_info; + + /* Commit graph bloom filter fields */ + /* The bloom filter key for the pathspec */ + struct bloom_key *bloom_key; + /* + * The bloom filter settings used to generate the key. + * This is loaded from the commit-graph being used. + */ + struct bloom_filter_settings *bloom_filter_settings; }; int ref_excluded(struct string_list *, const char *path); From patchwork Mon Apr 6 16:59:53 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475923 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6B80F912 for ; Mon, 6 Apr 2020 17:00:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 49C4D233EB for ; Mon, 6 Apr 2020 17:00:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="CrcxUuwY" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729761AbgDFRAL (ORCPT ); Mon, 6 Apr 2020 13:00:11 -0400 Received: from mail-ed1-f68.google.com ([209.85.208.68]:38532 "EHLO mail-ed1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729534AbgDFRAI (ORCPT ); Mon, 6 Apr 2020 13:00:08 -0400 Received: by mail-ed1-f68.google.com with SMTP id e5so379343edq.5 for ; Mon, 06 Apr 2020 10:00:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=l2hH/TavINoYF3ZV0UYu/koCVh28c/8hMEtEzNLktok=; b=CrcxUuwY3cf+hU4l98DuBjnAi5gp1SrSjez/p93iyFnQAQeQENmWl+tc8PPHrjFq9/ 8OWKpRs1sCXVqiRbgXF33xy9I6oiShweFeikwhGZ32XkdOHUMM+JU84w8e7EGjVj9pnD W2OvyK/uVrpxZnNWhw9nLckvYC5uz25o7qKRROk9drlYMhMXBfkBxnPdM/Ge36zpH718 jUV7Jb4HBRDbgXdvzFcMBX2T9E4hllub/EYUs8CfqdK68hv9j8timjC2OsiNDxpiZiqk THHlGffWQhE4EjoZ13XiJ/BsX5W7WzO/IwU1JFNzP3qyl/r7CPGPHveSwVCIHB8JNo3D CfZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=l2hH/TavINoYF3ZV0UYu/koCVh28c/8hMEtEzNLktok=; b=rCXmNgGJBXdDkaLJygxLdqOdftHTF/AtrUDGk33giC/pbz1bjECxI0L2pY2CDkNHYl I51/s876+FFhgSLkpnkWihsCxhEaYfu1DnMFI0bohDFjUA71RUzXN2+XVqZFlqsqrq+L ue00NwfqJ6xmf3S8SnzAroLErTKKD8MPlZgYb8KAVb3IF6aAMPo2sLLMm19gBKn/bcK6 roqtlPzsUnl8R6JG7IxDJNwlPX3/cu5Zh1DczgLHCuZmcxaMA/GFbi0wkbaUyO579VhP QGq+/7XyhbwBe/4Iyur+KMOsuuFnhiYp/Dxh1fhwN7tFRWq+IlN0bCmN5a77hGNa6o10 BP8w== X-Gm-Message-State: AGi0PubWyT5WKHk2Sh/WQZIRLjMcD++8kCMO/VwTKxbMikreGK7uplSZ 8PurTPLT1jsJZZtWSHx1MIxOtae8 X-Google-Smtp-Source: APiQypJk+7aazjvYp5C0ToWMQTHS0n5rFKg/Ap78Bl/W381ZLoadFSJJtPtnKpnHsp2rz2cVDqvTQQ== X-Received: by 2002:a50:eb08:: with SMTP id y8mr19198984edp.49.1586192406730; Mon, 06 Apr 2020 10:00:06 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id d13sm1935869ejt.74.2020.04.06.10.00.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:06 -0700 (PDT) Message-Id: <6beaede715972d7726dda8a58bbd4b920813e194.1586192395.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:53 +0000 Subject: [PATCH v4 13/15] revision.c: add trace2 stats around Bloom filter usage MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh Add trace2 statistics around Bloom filter usage and behavior for 'git log -- path' commands that are hoping to benefit from the presence of computed changed paths Bloom filters. These statistics are great for performance analysis work and for formal testing, which we will see in the commit following this one. Helped-by: Derrick Stolee Helped-by: Jonathan Tan Signed-off-by: Garima Singh --- revision.c | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/revision.c b/revision.c index d3fcb7c6ff6..2b06ee739c8 100644 --- a/revision.c +++ b/revision.c @@ -30,6 +30,7 @@ #include "hashmap.h" #include "utf8.h" #include "bloom.h" +#include "json-writer.h" volatile show_early_output_fn_t show_early_output; @@ -625,6 +626,30 @@ static void file_change(struct diff_options *options, options->flags.has_changes = 1; } +static int bloom_filter_atexit_registered; +static unsigned int count_bloom_filter_maybe; +static unsigned int count_bloom_filter_definitely_not; +static unsigned int count_bloom_filter_false_positive; +static unsigned int count_bloom_filter_not_present; +static unsigned int count_bloom_filter_length_zero; + +static void trace2_bloom_filter_statistics_atexit(void) +{ + struct json_writer jw = JSON_WRITER_INIT; + + jw_object_begin(&jw, 0); + jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present); + jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero); + jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe); + jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not); + jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive); + jw_end(&jw); + + trace2_data_json("bloom", the_repository, "statistics", &jw); + + jw_release(&jw); +} + static void prepare_to_use_bloom_filter(struct rev_info *revs) { struct pathspec_item *pi; @@ -661,6 +686,11 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) revs->bloom_key = xmalloc(sizeof(struct bloom_key)); fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); + if (trace2_is_enabled() && !bloom_filter_atexit_registered) { + atexit(trace2_bloom_filter_statistics_atexit); + bloom_filter_atexit_registered = 1; + } + free(path_alloc); } @@ -679,10 +709,12 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, filter = get_bloom_filter(revs->repo, commit, 0); if (!filter) { + count_bloom_filter_not_present++; return -1; } if (!filter->len) { + count_bloom_filter_length_zero++; return -1; } @@ -690,6 +722,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, revs->bloom_key, revs->bloom_filter_settings); + if (result) + count_bloom_filter_maybe++; + else + count_bloom_filter_definitely_not++; + return result; } @@ -736,6 +773,10 @@ static int rev_compare_tree(struct rev_info *revs, &revs->pruning) < 0) return REV_TREE_DIFFERENT; + if (!nth_parent) + if (bloom_ret == 1 && tree_difference == REV_TREE_SAME) + count_bloom_filter_false_positive++; + return tree_difference; } From patchwork Mon Apr 6 16:59:54 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475925 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E20A61871 for ; Mon, 6 Apr 2020 17:00:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B44A2224F9 for ; Mon, 6 Apr 2020 17:00:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="caYsfObo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729759AbgDFRAL (ORCPT ); Mon, 6 Apr 2020 13:00:11 -0400 Received: from mail-ed1-f54.google.com ([209.85.208.54]:39001 "EHLO mail-ed1-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729521AbgDFRAK (ORCPT ); Mon, 6 Apr 2020 13:00:10 -0400 Received: by mail-ed1-f54.google.com with SMTP id a43so371692edf.6 for ; Mon, 06 Apr 2020 10:00:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=ZsHUaVtS2qOnoUjyOO9pvlAfuFOOP8BOHJQ/886lGO4=; b=caYsfObolnfBGl7vRcnhVWquzT//5ZJmkdE84naTwoBl738o+785hqVNeZc31sseah nD25ZER10wANPQkuAI0Iao/DnibAkXoNy9MSErF1GcgAr5ZJWEBCwLLW3loAVf0pmGYf yiHoMKcOhXRNym8SaH3jEXXtX1+MYTQo6k13MjWXI9qIzAC8h14MwOrq+5pJMMRVghjr Xg7vvxqxCt8k5hbvmMolLnIn98aJamW3EioLEANnVLPGZ2vKidr3G8YCUL0yxVoQQUBn 494tEQbpYwhStDXS9k5rkanukKsP+hDVrBAmG4e+RTphjhscaadkcJl2bmiiz/UtaCn0 z4GQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=ZsHUaVtS2qOnoUjyOO9pvlAfuFOOP8BOHJQ/886lGO4=; b=i6ZmTTmngJXlqTUalGZWrjZXKU5S5jA6A6T2nm86h8f71W/6BmkB6H605T/TkREg2V pe/E+4UAOy79JSacXlIweBiJe1VGHtzEHADz0dmT4j/L8RzUgv/zcGv6UTmePBHIZ4b7 +JD2n5nTl69v1ZdPv7GFNu5F8jDvKjoNMOP2xz86nIFxI5SwnXz5JORXiA52owNVWHlB 3V7DTqi4kdosE27fD6C+8tRbzRETn0et3Zyo/eX3mdsQoL2wQwoKADzF1lRsSNintdl1 NtodUpB0H4wYuqXtruLYsuhtWeWvukTjlsQ8MfYDNb+8Aep+7AvP1jPbiIDA0IJd955K 8ywg== X-Gm-Message-State: AGi0PubrxSTwuqsar86MPPHn/7WptSHcDvBSoISTSql4Xlxt+Et/DYO2 x0n0FCr5dtk/9X5rm/7pb78YgNm4 X-Google-Smtp-Source: APiQypLVXNlbMOAVh3Q2Z/fq158wY84DLSbBeB7blpxEye7H71AStZfsDxZPyl5UGMLr77jNC+Q6YA== X-Received: by 2002:a05:6402:1655:: with SMTP id s21mr20033695edx.295.1586192407497; Mon, 06 Apr 2020 10:00:07 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id b15sm2486178edn.69.2020.04.06.10.00.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:07 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:54 +0000 Subject: [PATCH v4 14/15] t4216: add end to end tests for git log with Bloom filters Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh These tests exercises writing commit graph with Bloom filters and exercises 'git log -- path' with all the applicable options. They check that the output is the same with and without Bloom filters, confirm Bloom filters were used by checking if trace2 statistics were logged correctly. Also confirms cases where Bloom filters are not used: 1. Multiple path specs, 2. --walk-reflogs (see patch titled 'revision.c: use Bloom filters...' for details, 3. If the latest commit graph does not have Bloom filters Signed-off-by: Garima Singh --- t/helper/test-read-graph.c | 4 + t/t4216-log-bloom.sh | 155 +++++++++++++++++++++++++++++++++++++ 2 files changed, 159 insertions(+) create mode 100755 t/t4216-log-bloom.sh diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c index f8a461767ca..4223ff32fb6 100644 --- a/t/helper/test-read-graph.c +++ b/t/helper/test-read-graph.c @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv) printf(" commit_metadata"); if (graph->chunk_extra_edges) printf(" extra_edges"); + if (graph->chunk_bloom_indexes) + printf(" bloom_indexes"); + if (graph->chunk_bloom_data) + printf(" bloom_data"); printf("\n"); UNLEAK(graph); diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh new file mode 100755 index 00000000000..38accd272df --- /dev/null +++ b/t/t4216-log-bloom.sh @@ -0,0 +1,155 @@ +#!/bin/sh + +test_description='git log for a path with Bloom filters' +. ./test-lib.sh + +GIT_TEST_COMMIT_GRAPH=0 +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 + +test_expect_success 'setup test - repo, commits, commit graph, log outputs' ' + git init && + mkdir A A/B A/B/C && + test_commit c1 A/file1 && + test_commit c2 A/B/file2 && + test_commit c3 A/B/C/file3 && + test_commit c4 A/file1 && + test_commit c5 A/B/file2 && + test_commit c6 A/B/C/file3 && + test_commit c7 A/file1 && + test_commit c8 A/B/file2 && + test_commit c9 A/B/C/file3 && + test_commit c10 file_to_be_deleted && + git checkout -b side HEAD~4 && + test_commit side-1 file4 && + git checkout master && + git merge side && + test_commit c11 file5 && + mv file5 file5_renamed && + git add file5_renamed && + git commit -m "rename" && + rm file_to_be_deleted && + git add . && + git commit -m "file removed" && + git commit-graph write --reachable --changed-paths +' +graph_read_expect () { + NUM_CHUNKS=5 + cat >expect <<- EOF + header: 43475048 1 1 $NUM_CHUNKS 0 + num_commits: $1 + chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data + EOF + test-tool read-graph >actual && + test_cmp expect actual +} + +test_expect_success 'commit-graph write wrote out the bloom chunks' ' + graph_read_expect 15 +' + +# Turn off any inherited trace2 settings for this test. +sane_unset GIT_TRACE2 GIT_TRACE2_PERF GIT_TRACE2_EVENT +sane_unset GIT_TRACE2_PERF_BRIEF +sane_unset GIT_TRACE2_CONFIG_PARAMS + +setup () { + rm "$TRASH_DIRECTORY/trace.perf" + git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom && + GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom +} + +test_bloom_filters_used () { + log_args=$1 + bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\"" + setup "$log_args" && + grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && + test_cmp log_wo_bloom log_w_bloom && + test_path_is_file "$TRASH_DIRECTORY/trace.perf" +} + +test_bloom_filters_not_used () { + log_args=$1 + setup "$log_args" && + !(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && + test_cmp log_wo_bloom log_w_bloom +} + +for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5 file5_renamed file_to_be_deleted +do + for option in "" \ + "--all" \ + "--full-history" \ + "--full-history --simplify-merges" \ + "--simplify-merges" \ + "--simplify-by-decoration" \ + "--follow" \ + "--first-parent" \ + "--topo-order" \ + "--date-order" \ + "--author-date-order" \ + "--ancestry-path side..master" + do + test_expect_success "git log option: $option for path: $path" ' + test_bloom_filters_used "$option -- $path" + ' + done +done + +test_expect_success 'git log -- folder works with and without the trailing slash' ' + test_bloom_filters_used "-- A" && + test_bloom_filters_used "-- A/" +' + +test_expect_success 'git log for path that does not exist. ' ' + test_bloom_filters_used "-- path_does_not_exist" +' + +test_expect_success 'git log with --walk-reflogs does not use Bloom filters' ' + test_bloom_filters_not_used "--walk-reflogs -- A" +' + +test_expect_success 'git log -- multiple path specs does not use Bloom filters' ' + test_bloom_filters_not_used "-- file4 A/file1" +' + +test_expect_success 'git log with wildcard that resolves to a single path uses Bloom filters' ' + test_bloom_filters_used "-- *4" && + test_bloom_filters_used "-- *renamed" +' + +test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' ' + test_bloom_filters_not_used "-- *" && + test_bloom_filters_not_used "-- file*" +' + +test_expect_success 'setup - add commit-graph to the chain without Bloom filters' ' + test_commit c14 A/anotherFile2 && + test_commit c15 A/B/anotherFile2 && + test_commit c16 A/B/C/anotherFile2 && + GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split && + test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain +' + +test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' ' + test_bloom_filters_not_used "-- A/B" +' + +test_expect_success 'setup - add commit-graph to the chain with Bloom filters' ' + test_commit c17 A/anotherFile3 && + git commit-graph write --reachable --changed-paths --split && + test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain +' + +test_bloom_filters_used_when_some_filters_are_missing () { + log_args=$1 + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6" + setup "$log_args" && + grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && + test_cmp log_wo_bloom log_w_bloom +} + +test_expect_success 'Use Bloom filters if they exist in the latest but not all commit graphs in the chain.' ' + test_bloom_filters_used_when_some_filters_are_missing "-- A/B" +' + +test_done \ No newline at end of file From patchwork Mon Apr 6 16:59:55 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Arver via GitGitGadget X-Patchwork-Id: 11475931 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4EE321892 for ; Mon, 6 Apr 2020 17:00:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 2CC4623BCF for ; Mon, 6 Apr 2020 17:00:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="rCsV2UIo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729763AbgDFRAN (ORCPT ); Mon, 6 Apr 2020 13:00:13 -0400 Received: from mail-ed1-f66.google.com ([209.85.208.66]:46981 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729731AbgDFRAK (ORCPT ); Mon, 6 Apr 2020 13:00:10 -0400 Received: by mail-ed1-f66.google.com with SMTP id cf14so315425edb.13 for ; Mon, 06 Apr 2020 10:00:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=dNfQ6jVSrkUzYV6pOHZOfey72aL6gNJTsqpvfb9xUaA=; b=rCsV2UIo1eEySvhUiPGAQ9lIutLHS/NJ+vaHTuEoAlcOdtpAZmWtrfsHxFFcEc5brX WZfJFZr98uCo1hExyrrsuNc93iwnF7rkVQ9M58XGNB7lPixKqMMqaMtxdabEB4deNmw7 4w3ldGOJCrFD1uHRL1hh8b1UZ1KqHWEKAlZin3KUNypYiEacOWXftufnKerKEFwKQIJl ATUVqICesKZdeNhH7WQwPZEQrlupkyh2TVwbkxQhyCPKID6b3qTIMZ0556tUJRV/C1rf yCq8sVYi6TAtfORl/zynjny0G66KmULYfHXdUuTUdkZeJiLmhiOGT2V6wV91uRRT3xa3 r72w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=dNfQ6jVSrkUzYV6pOHZOfey72aL6gNJTsqpvfb9xUaA=; b=rLK4IEZOCbs/67+NbRMg2LcnQHDiSG+kj4SacvPCdY2OSULh/udUygMxZuL+sXr8yR t4FV9YeJk2moO/SMlCGyxoagd0//E7YXdGKrN4/7qNjs/TpFsQIgTMtFm7T+imqqxxkk Ccny17oZdZvtqbe9FDyv5C7r1BOhwoHrynIBExF45fxJeUh2fEW3OradmP38qddiiihH YHLpib4VY3+OrJi6WzWNY+Sj8YjkrhKEAShdgyYup9lggcizc61FwS/qtgti9tRUsSyo QlQiIxeVA5CmDUmXb+zxw213OAMP0KLXUemECY+U2gwh/YlaHnmWHuvAFT/RKruuua4q 3gdw== X-Gm-Message-State: AGi0Puatj6XiqE0I4/7wo4OcRWs/Biz2H+cX0QgAlOmUehxcMmSOHaWv ftvYEGA55hs0tXRCT9P0zKad1eK+ X-Google-Smtp-Source: APiQypI+XvGx893r2aoP1mPVrf5kmpAPFkSsc6hIodRQRkM/GAkMYJUBHK1VW7xy+gg99EcX5TWFLQ== X-Received: by 2002:a17:906:493:: with SMTP id f19mr399898eja.171.1586192408211; Mon, 06 Apr 2020 10:00:08 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id r9sm2977449ejb.8.2020.04.06.10.00.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Apr 2020 10:00:07 -0700 (PDT) Message-Id: <5656e8590e9d4b42800e9ce53b10c45c63f03a1a.1586192395.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Garima Singh via GitGitGadget" Date: Mon, 06 Apr 2020 16:59:55 +0000 Subject: [PATCH v4 15/15] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, Garima Singh , Garima Singh Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Garima Singh Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite in order to toggle writing Bloom filters when running any of the git tests. If set to true, we will compute and write Bloom filters every time a test calls `git commit-graph write`, as if the `--changed-paths` option was passed in. The test suite passes when GIT_TEST_COMMIT_GRAPH and GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled. Helped-by: Derrick Stolee Signed-off-by: Garima Singh --- builtin/commit-graph.c | 3 ++- ci/run-build-and-tests.sh | 1 + commit-graph.h | 1 + t/README | 5 +++++ t/t5318-commit-graph.sh | 2 ++ t/t5324-split-commit-graph.sh | 1 + 6 files changed, 12 insertions(+), 1 deletion(-) diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index cacb5d04a80..59009837dc9 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -171,7 +171,8 @@ static int graph_write(int argc, const char **argv) flags |= COMMIT_GRAPH_WRITE_SPLIT; if (opts.progress) flags |= COMMIT_GRAPH_WRITE_PROGRESS; - if (opts.enable_changed_paths) + if (opts.enable_changed_paths || + git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0)) flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS; read_replace_refs = 0; diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh index 4df54c4efea..17e25aade96 100755 --- a/ci/run-build-and-tests.sh +++ b/ci/run-build-and-tests.sh @@ -19,6 +19,7 @@ linux-gcc) export GIT_TEST_OE_SIZE=10 export GIT_TEST_OE_DELTA_SIZE=5 export GIT_TEST_COMMIT_GRAPH=1 + export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1 export GIT_TEST_MULTI_PACK_INDEX=1 export GIT_TEST_ADD_I_USE_BUILTIN=1 make test diff --git a/commit-graph.h b/commit-graph.h index 8e7a8e0e5b2..8655d064c14 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -9,6 +9,7 @@ #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH" #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD" +#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS" struct commit; struct bloom_filter_settings; diff --git a/t/README b/t/README index 369e3a9ded8..4f53da53a15 100644 --- a/t/README +++ b/t/README @@ -378,6 +378,11 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to be written after every 'git commit' command, and overrides the 'core.commitGraph' setting to true. +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=, when true, forces +commit-graph write to compute and write changed path Bloom filters for +every 'git commit-graph write', as if the `--changed-paths` option was +passed in. + GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor code path for utilizing a file system monitor to speed up detecting new or changed files. diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 9bf920ae171..18304a65e4d 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -3,6 +3,8 @@ test_description='commit graph' . ./test-lib.sh +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 + test_expect_success 'setup full repo' ' mkdir full && cd "$TRASH_DIRECTORY/full" && diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh index 53b2e6b4555..d3f1f2c4a71 100755 --- a/t/t5324-split-commit-graph.sh +++ b/t/t5324-split-commit-graph.sh @@ -4,6 +4,7 @@ test_description='split commit graph' . ./test-lib.sh GIT_TEST_COMMIT_GRAPH=0 +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 test_expect_success 'setup repo' ' git init &&