From patchwork Fri May 29 08:50:05 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578109
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2A83A912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:03 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 113F7207F5
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:03 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="ic0DH/Xn"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726441AbgE2IvB (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:01 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44306 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725865AbgE2Iu7 (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:50:59 -0400
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84AA6C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:50:59 -0700 (PDT)
Received: by mail-wr1-x443.google.com with SMTP id s8so2515001wrt.9
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:50:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=vPvBaxVdZRHaGysXicN82lxe6K1U2p0UZtACd6nto4g=;
        b=ic0DH/XnsO4rILAaHKqPeC0OiP/hJFqb2xGQ803zPFaqYmAOamVIHrU1gNQS3YljzT
         p2PBL6xC7Im+jbGWsaezFLwk+mY+PRWu9WFISXBXoZ1aPtxrsVceARelL7ck1OHyzcTP
         1oJtxd2OvxP4+wZKbRJSNK/Md9BjOnquGMV0wVXUI+Wkij2NShh+qM/bwPLuH8mcrwmn
         bmcL2iRASCVbF3D0i8iKAgOkVAegrP/fO/5Ci8LQU+h6lSn44HyOy5yRdt1vgLkkFg+s
         0esqVEb4X83midS/ixvooz2fsaeyKEkgAeXu73n1sHbdaPTWoZZjq0uV0qMC0Swbl5ov
         642w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=vPvBaxVdZRHaGysXicN82lxe6K1U2p0UZtACd6nto4g=;
        b=ORSYjRAdOXtrsaA+QkeQc316d8BnUwHjGABg59mRpNg641Fd0ORHUGNpCG23Z5jZog
         uSM3WksV5frnwprNnGV6/V2v/1Pahd529ltTjA6GHb1G5Npg4eZF28T8vTVNDcAwtQw+
         gkNjOcapPTM95Ye96KEyYSd6XhnBseLPPFBav859abzSwIvkC9N9wu3el2XrA8CD7+E+
         yq7TofV+/mnlxV2Jg1S5/ahx6Qo7l4TV3CUNGRBe93T4uM2MR9Y2bH9wqHjpAMpFow18
         2kavoBq331HoBLPql0WEAElfFMY4r9MBwtXSgnZYwXT9C6n4Ub4v1ytpOn77D2Osj+s3
         H2DQ==
X-Gm-Message-State: AOAM530Ih3bh9nSTT9cSK8Y+2qupFWrYCkE5JlTJqmS5lD3hzw1qk64T
        Q045BoEmycZ9bFHpd8R5LkxotLXk
X-Google-Smtp-Source: 
 ABdhPJyz3hCVzsTpL1VGFMS9LpmtzOvE3ute/XWca7XzTmywfoQ+0Tm4i1LUahChOnRQYmusGv6Bpw==
X-Received: by 2002:adf:fb0f:: with SMTP id c15mr8138918wrr.410.1590742257942;
        Fri, 29 May 2020 01:50:57 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.50.56
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:50:56 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 01/34] tree-walk.c: don't match submodule entries for
 'submod/anything'
Date: Fri, 29 May 2020 10:50:05 +0200
Message-Id: <20200529085038.26008-2-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Submodules should be handled the same as regular directories with
respect to the presence of a trailing slash, i.e. commands like:

  git diff rev1 rev2 -- $path
  git rev-list HEAD -- $path

should produce the same output whether $path is 'submod' or 'submod/'.
This has been fixed in commit 74b4f7f277 (tree-walk.c: ignore trailing
slash on submodule in tree_entry_interesting(), 2014-01-23).

Unfortunately, that commit had the unintended side effect to handle
'submod/anything' the same as 'submod' and 'submod/' as well, e.g.:

  $ git log --oneline --name-only -- sha1collisiondetection/whatever
  4125f78222 sha1dc: update from upstream
  sha1collisiondetection
  07a20f569b Makefile: fix unaligned loads in sha1dc with UBSan
  sha1collisiondetection
  23e37f8e9d sha1dc: update from upstream
  sha1collisiondetection
  86cfd61e6b sha1dc: optionally use sha1collisiondetection as a submodule
  sha1collisiondetection

Fix this by rejecting submodules as partial pathnames when their
trailing slash is followed by anything.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 t/t4010-diff-pathspec.sh | 4 +++-
 tree-walk.c              | 9 ++++++++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/t/t4010-diff-pathspec.sh b/t/t4010-diff-pathspec.sh
index e5ca359edf..65cc703c65 100755
--- a/t/t4010-diff-pathspec.sh
+++ b/t/t4010-diff-pathspec.sh
@@ -125,7 +125,9 @@ test_expect_success 'setup submodules' '
 test_expect_success 'diff-tree ignores trailing slash on submodule path' '
 	git diff --name-only HEAD^ HEAD submod >expect &&
 	git diff --name-only HEAD^ HEAD submod/ >actual &&
-	test_cmp expect actual
+	test_cmp expect actual &&
+	git diff --name-only HEAD^ HEAD -- submod/whatever >actual &&
+	test_must_be_empty actual
 '
 
 test_expect_success 'diff multiple wildcard pathspecs' '
diff --git a/tree-walk.c b/tree-walk.c
index d5a8e096a6..7fbb150312 100644
--- a/tree-walk.c
+++ b/tree-walk.c
@@ -849,7 +849,14 @@ static int match_entry(const struct pathspec_item *item,
 	if (matchlen > pathlen) {
 		if (match[pathlen] != '/')
 			return 0;
-		if (!S_ISDIR(entry->mode) && !S_ISGITLINK(entry->mode))
+		/*
+		 * Reject non-directories as partial pathnames, except
+		 * when match is a submodule with a trailing slash and
+		 * nothing else (to handle 'submod/' and 'submod'
+		 * uniformly).
+		 */
+		if (!S_ISDIR(entry->mode) &&
+		    (!S_ISGITLINK(entry->mode) || matchlen > pathlen + 1))
 			return 0;
 	}
 

From patchwork Fri May 29 08:50:06 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578111
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4BF91739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:05 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 319B520723
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:05 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Umxakuz8"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726568AbgE2IvE (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:04 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44312 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726415AbgE2IvB (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:01 -0400
Received: from mail-wm1-x342.google.com (mail-wm1-x342.google.com
 [IPv6:2a00:1450:4864:20::342])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F0B14C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:00 -0700 (PDT)
Received: by mail-wm1-x342.google.com with SMTP id j198so6306951wmj.0
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=X1qP0yIzBq54g87TwIKpV0SZXbQXFiYLM+euJ7imfDI=;
        b=Umxakuz81juFaz7GejIqZc14icOYWJyeRTlP+MB8yVjPacdu2wuwvGMXGXPxtK5pvi
         UatcJGgEa17PK23qUIzL14qr+HEwarUGWwAT33iSuFXDcUIph8f9s+rPs/flAAKxs6EY
         PRNljga7miQ0zNuHaYlOy4D7WzVeeq33fTlyFE+bB6HiO5CizS0tH1q84OTvGrIYZEja
         JEs1x/K/MFVkf5w58AFnMGA95BVk8NvQzmQKzQNLeFTlngqr3Wq4Omzfv3IdYe3gJx2y
         xX2+wukKSPZbvjceQtUVEBsA62TIWkgyCPCN0N2YuyLUs5nJcBI3lffyK8LyrDNUe88L
         N0ug==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=X1qP0yIzBq54g87TwIKpV0SZXbQXFiYLM+euJ7imfDI=;
        b=iDgwuk1c4fAwF9tHFf1Q73XGunIfpy2J0t3lVgEnvCY/2rAdBmNYp2HqR5rDuhxK/W
         09rmcxhixTpFrDZ7Kpc9+VbnsVdCsF4DBMhqXRbPDassx+8xl1Xtcmg6xbQngEcpi5SA
         Hm6zGTWByjSOZtK1blOXzgQgkqxI/6Qy/8OMjEDXuwG1IDEBh/99qzEALVbwSiKnEgTl
         39Qq4StY/r+WN4EStb0PktJCPlXb1lV4osoX4I1eQdSCd0RFEBlfr9CzcAvKTBBrmLDQ
         AqX5AmSR7YhcWy9LzduVAC1IO7TrJ1/Rxgk0QK3of7vaJc210orzw4dyD0C9NliYJNaD
         0g1w==
X-Gm-Message-State: AOAM532jkmA0edYVSxZBp9T/3LLWrFwgkscH6w4YlxqBqOHaJ3YC+N+G
        Xc0xI5sVIKL+zjvWvhaRQdmOmv/4
X-Google-Smtp-Source: 
 ABdhPJz86zEf39sMAZCXFBcoN9KWx5Gynq1s1fwiimZjjhFYamq+nzqqDckTcdV9PP4WxM4jWFM6Cw==
X-Received: by 2002:a1c:2bc2:: with SMTP id r185mr7954274wmr.49.1590742259324;
        Fri, 29 May 2020 01:50:59 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.50.58
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:50:58 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table
Date: Fri, 29 May 2020 10:50:06 +0200
Message-Id: <20200529085038.26008-3-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

The commit-graph file format specifies that the chunks may be in any
order.  However, if the OID Lookup chunk happens to be the last one in
the file, then any command attempting to access the commit-graph data
will fail with:

  fatal: invalid commit position. commit-graph is likely corrupt

In this case the error is wrong, the commit-graph file does conform to
the specification, but the parsing of the Chunk Lookup table is a bit
buggy, and leaves the field holding the number of commits in the
commit-graph zero-initialized.

The number of commits in the commit-graph is determined while parsing
the Chunk Lookup table, by dividing the size of the OID Lookup chunk
with the hash size.  However, the Chunk Lookup table doesn't actually
store the size of the chunks, but it stores their starting offset.
Consequently, the size of a chunk can only be calculated by
subtracting the starting offsets of that chunk from the offset of the
subsequent chunk, or in case of the last chunk from the offset
recorded in the terminating label.  This is currenly implemented in a
bit complicated way: as we iterate over the entries of the Chunk
Lookup table, we check the ID of each chunk and store its starting
offset, then we check the ID of the last seen chunk and calculate its
size using its previously saved offset if necessary (at the moment
it's only necessary for the OID Lookup chunk).  Alas, while parsing
the Chunk Lookup table we only interate through the "real" chunks, but
never look at the terminating label, thus don't even check whether
it's necessary to calulate the size of the last chunk.  Consequently,
if the OID Lookup chunk is the last one, then we don't calculate its
size and turn don't run the piece of code determining the number of
commits in the commit graph, leaving the field holding that number
unchanged (i.e. zero-initialized), eventually triggering the sanity
check in load_oid_from_graph().

Fix this by iterating through all entries in the Chunk Lookup table,
including the terminating label.

Note that this is the minimal fix, suitable for the maintenance track.
A better fix would be to simplify how the chunk sizes are calculated,
but that is a more invasive change, less suitable for 'maint', so that
will be done in later patches.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index b205e65ed1..12df74c925 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -222,7 +222,7 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 	last_chunk_id = 0;
 	last_chunk_offset = 8;
 	chunk_lookup = data + 8;
-	for (i = 0; i < graph->num_chunks; i++) {
+	for (i = 0; i <= graph->num_chunks; i++) {
 		uint32_t chunk_id;
 		uint64_t chunk_offset;
 		int chunk_repeated = 0;

From patchwork Fri May 29 08:50:07 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578113
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8FA67159A
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:05 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 7730F207F5
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:05 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="RS7iQ/Hg"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726509AbgE2IvE (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:04 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44318 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725865AbgE2IvC (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:02 -0400
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55B60C08C5C6
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:02 -0700 (PDT)
Received: by mail-wr1-x443.google.com with SMTP id x13so2570470wrv.4
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=2lJtXgjvmPX0bWg693Puho7kU2Einhh9C74g93LZTZE=;
        b=RS7iQ/HgtqkYuh6M8rGBmfjwcSCG5TsqR8jAw7lIBcs20oHt+6hCMuaMnVRrg/JKAo
         65TQrcbCvov55Qm+YGjM0YS3wvjHY8VWLO/KuJOtgTtdNaAfIPWg0wTv5a0YtzYx5pcH
         CdV/2K0E+nQMSBaz0AWiacOTC7EsBIJ2niTvGRsrwHPIqUx7OsMydznavSxiIT0/fDVm
         QxxuvXnNAXbC0dbfZiRsREkdkb0XKmG1QXJ4EZAW7FJCCBm4ukFXK29Ixr4fG0/UkqOv
         f8oOrHEjCcfSJhyVs5zTT0wrZXMVDDlfu6HSAd52nQeHxSl+hibCXNqKqCfduI/Im7eD
         rHkQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=2lJtXgjvmPX0bWg693Puho7kU2Einhh9C74g93LZTZE=;
        b=NJ1aUIHelXqZVzDC8Jumytj8K9/6v52PzrgyxOqWBO5Ui0tOwQOlHNax6v7115oTHT
         uLWFFBR+xdxXLtjJHzk52uUHPfqheixckqG86jMGXvMYQEe2poRQQe6vlplfmq2jBbza
         rSFzRcz+akKMX2Y/8gXl/Uc37G7jEb1nboIXKKD821GLnWvX4TDOGR9k+A6S7VfRUxGb
         xxKWpv77eiRT0HW/GnBxp/BZ0emNY5UvzZJIjDEuNUt4agLuqUCKyYdCfhzI35K8bT5i
         e/Q/YaNeZro6WH3eqfHjxSWQi2rrYtuKKzTU62JU1H/88fK3ArfdjtRKqqBinpjEpY40
         3Ldg==
X-Gm-Message-State: AOAM530N4TD3RPZLzfPnPIKWcs2Gp9Tt2jgmYQ+VbVXCeAml3O36y6vI
        PlbHabvSevtd8lBr9LF5FObE32hw
X-Google-Smtp-Source: 
 ABdhPJyvekspXUrv6ff0YtSThEIw1M1roPlAIv2LG1dBZZVisoaPf2RrpsIa48+Gk0zswwbTrrJyHw==
X-Received: by 2002:a5d:5389:: with SMTP id d9mr8223139wrv.77.1590742260723;
        Fri, 29 May 2020 01:51:00 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.50.59
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:00 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in
 network byte order
Date: Fri, 29 May 2020 10:50:07 +0200
Message-Id: <20200529085038.26008-4-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

The commit-graph format specifies that "All 4-byte numbers are in
network order", but the commit-graph contains 8-byte integers as well
(file offsets in the Chunk Lookup table), and their byte order is
unspecified.

Clarify that all multi-byte integers are in network byte order.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 Documentation/technical/commit-graph-format.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index a4f17441ae..efd0f99d8b 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -29,7 +29,7 @@ the body into "chunks" and provide a binary lookup table at the beginning
 of the body. The header includes certain values, such as number of chunks
 and hash type.
 
-All 4-byte numbers are in network order.
+All multi-byte numbers are in network byte order.
 
 HEADER:
 

From patchwork Fri May 29 08:50:08 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578115
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A4890739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:10 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 8BC9820776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:10 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="mKyyOxuj"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726729AbgE2IvJ (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:09 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44322 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726415AbgE2IvF (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:05 -0400
Received: from mail-wr1-x441.google.com (mail-wr1-x441.google.com
 [IPv6:2a00:1450:4864:20::441])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D6EDAC03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:03 -0700 (PDT)
Received: by mail-wr1-x441.google.com with SMTP id l11so2623710wru.0
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=67gacoTzEwJUoqBISrjk2gqL97djVGYmY6G3JJpyUIs=;
        b=mKyyOxujEpcv73tRmXiPuVcpGxPL69ZUaKSO5ELtwYoQe66w95NW/OBPX7wRg7RaNb
         wI/q2x6HYEwIbBQ+zg9Yp0/tfWr3suSg9fvXi8TV/iQ5yrN+gNvQ8Ixh8wu3M7NVWv81
         7bGHm2Zr1D0g38KVgwHoj9Hw0YRya3IfZ4fpjb7bPcTZ2qXZN50G+COf50D18Elmr9c4
         3cSMiHoKxmhnCsF41f4ZS42Tch4092MXKrKTyb/59my5L7JSaGuS/obb9lEe0qfeIPyT
         Vm0HtI9Q92LyEw1/e2SroNFOquHGyaiR4wFgRvqFQ9PTFF0SgpgLf7VDITq3hgXnjxvO
         HFbQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=67gacoTzEwJUoqBISrjk2gqL97djVGYmY6G3JJpyUIs=;
        b=p8Cbj6OENlawa7dJw2xav9UNLR9GKMLCt1WG5uJUllY0KReeglBrO650XY53dNQojd
         gcQEtVZfDmgpX82KAkc4L5jYK14sQxhCaRXSvQICdLcryX815MY7LY+wWz03RYub5z3s
         7elzKehbO7Mhw5wiZrecrZL6uGO4zX23kLa+8TmORyeVVEejSpMvypHCwtpXUSt/wVaC
         3BCkEpBJBKUlJxQZjX25RTvctan27kxODq7u2/yj2QTAf1V2lY1UIrGxu8vDb+ceDcKP
         Sy5PsoIvQOjqZQoFvBYczgfY5DZUk1xZTxBOq+OLfwz+uNKMexLCW7jpQc87mC1ktvC6
         HZXw==
X-Gm-Message-State: AOAM533mCuY/yNGRevJBmyOXChoaWm05oS9DzSzPoTQON4DSG9AKVjMk
        xHXf4PGED8ciyK2KHtuPztJhCOUd
X-Google-Smtp-Source: 
 ABdhPJyEPcF3t8u2zplzWL0VcoD4KlIz2pm5x3jSJ4wHGMmHIEF25nPIEhVmHMJvOhZMWILzoKrJ/g==
X-Received: by 2002:a5d:518b:: with SMTP id k11mr8310208wrv.58.1590742262101;
        Fri, 29 May 2020 01:51:02 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.01
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:01 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 04/34] commit-slab: add a function to deep free entries on the
 slab
Date: Fri, 29 May 2020 10:50:08 +0200
Message-Id: <20200529085038.26008-5-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

clear_##slabname() frees only the memory allocated for a commit slab
itself, but entries in the commit slab might own additional memory
outside the slab that should be freed as well.  We already have (at
least) one such commit slab, and this patch series is about to add one
more.

To free all additional memory owned by entries on the commit slab the
user of such a slab could iterate over all commits it knows about,
peek whether there is a valid entry associated with each commit, and
free the additional memory, if any.  Or it could rely on intimate
knowledge about the internals of the commit slab implementation, and
could itself iterate directly through all entries in the slab, and
free the additional memory.  Or it could just leak the additional
memory...

Introduce deep_clear_##slabname() to allow releasing memory owned by
commit slab entries by invoking the 'void free_fn(elemtype *ptr)'
function specified as parameter for each entry in the slab.

Use it in get_shallow_commits() in 'shallow.c' to replace an
open-coded iteration over a commit slab's entries.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-slab-decl.h |  1 +
 commit-slab-impl.h | 13 +++++++++++++
 commit-slab.h      | 10 ++++++++++
 shallow.c          | 14 +++++---------
 4 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/commit-slab-decl.h b/commit-slab-decl.h
index adc7b46c83..286164b7e2 100644
--- a/commit-slab-decl.h
+++ b/commit-slab-decl.h
@@ -32,6 +32,7 @@ struct slabname {							\
 void init_ ##slabname## _with_stride(struct slabname *s, unsigned stride); \
 void init_ ##slabname(struct slabname *s);				\
 void clear_ ##slabname(struct slabname *s);				\
+void deep_clear_ ##slabname(struct slabname *s, void (*free_fn)(elemtype *ptr)); \
 elemtype *slabname## _at_peek(struct slabname *s, const struct commit *c, int add_if_missing); \
 elemtype *slabname## _at(struct slabname *s, const struct commit *c);	\
 elemtype *slabname## _peek(struct slabname *s, const struct commit *c)
diff --git a/commit-slab-impl.h b/commit-slab-impl.h
index 5c0eb91a5d..557738df27 100644
--- a/commit-slab-impl.h
+++ b/commit-slab-impl.h
@@ -38,6 +38,19 @@ scope void clear_ ##slabname(struct slabname *s)			\
 	FREE_AND_NULL(s->slab);						\
 }									\
 									\
+scope void deep_clear_ ##slabname(struct slabname *s, void (*free_fn)(elemtype *)) \
+{									\
+	unsigned int i;							\
+	for (i = 0; i < s->slab_count; i++) {				\
+		unsigned int j;						\
+		if (!s->slab[i])					\
+			continue;					\
+		for (j = 0; j < s->slab_size; j++)			\
+			free_fn(&s->slab[i][j * s->stride]);		\
+	}								\
+	clear_ ##slabname(s);						\
+}									\
+									\
 scope elemtype *slabname## _at_peek(struct slabname *s,			\
 						  const struct commit *c, \
 						  int add_if_missing)   \
diff --git a/commit-slab.h b/commit-slab.h
index 69bf0c807c..7bd472171e 100644
--- a/commit-slab.h
+++ b/commit-slab.h
@@ -42,6 +42,16 @@
  *
  *   Call this function before the slab falls out of scope to avoid
  *   leaking memory.
+ *
+ * - void deep_clear_indegree(struct indegree *, void (*free_fn)(int*))
+ *
+ *   Empties the slab, similar to clear_indegree(), but in addition it
+ *   calls the given 'free_fn' for each slab entry to release any
+ *   additional memory that might be owned by the entry (but not the
+ *   entry itself!).
+ *   Note that 'free_fn' might be called even for entries for which no
+ *   indegree_at() call has been made; in this case 'free_fn' is invoked
+ *   with a pointer to a zero-initialized location.
  */
 
 #define define_commit_slab(slabname, elemtype) \
diff --git a/shallow.c b/shallow.c
index 7fd04afed1..d02ba494ee 100644
--- a/shallow.c
+++ b/shallow.c
@@ -84,6 +84,10 @@ int is_repository_shallow(struct repository *r)
  * supports a "valid" flag.
  */
 define_commit_slab(commit_depth, int *);
+void free_depth_in_slab(int **ptr)
+{
+	FREE_AND_NULL(*ptr);
+}
 struct commit_list *get_shallow_commits(struct object_array *heads, int depth,
 		int shallow_flag, int not_shallow_flag)
 {
@@ -150,15 +154,7 @@ struct commit_list *get_shallow_commits(struct object_array *heads, int depth,
 			}
 		}
 	}
-	for (i = 0; i < depths.slab_count; i++) {
-		int j;
-
-		if (!depths.slab[i])
-			continue;
-		for (j = 0; j < depths.slab_size; j++)
-			free(depths.slab[i][j]);
-	}
-	clear_commit_depth(&depths);
+	deep_clear_commit_depth(&depths, free_depth_in_slab);
 
 	return result;
 }

From patchwork Fri May 29 08:50:09 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578117
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6ABF2739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:13 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 4DE1220776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:13 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="D/s3jJkU"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726771AbgE2IvK (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:10 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44330 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726579AbgE2IvF (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:05 -0400
Received: from mail-wm1-x343.google.com (mail-wm1-x343.google.com
 [IPv6:2a00:1450:4864:20::343])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 86936C08C5C6
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:05 -0700 (PDT)
Received: by mail-wm1-x343.google.com with SMTP id r15so2413139wmh.5
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=9feaE5i968rtyL1eHI9KIXlmeHS1tXADrHmqrjGlOZg=;
        b=D/s3jJkUSEcxc8O1HyXsTWSrutJrYjENw9C+ljo1ju+SL5mVHURTnNWR7JyFNKIMB3
         kfftHn7LnPdQ7e+vFwBf2K/408YtfmWiT1U1k/OTHxK+/KIuMT1nqVQQDvKNH68LTOU0
         rEzzHSC+2n+klQyUMYuAPrzg7zUNpyLe6cSIlIIsM40GeMFdcrFJkgCUQ3/Mk7dbPayg
         LHL2CqkdMNBupOq+hFuB7gB53WItcMCtve+l9jhHuVeKnAdOIq8T5PefPR3DwvArs1hp
         dZyzg6qkiGzjmUrZ2zllA/h+G9DrHhPibeV0xoN9FRFgvt3lIpJ06DwomLHWTTUFbLpp
         3R7Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=9feaE5i968rtyL1eHI9KIXlmeHS1tXADrHmqrjGlOZg=;
        b=tDlF1LwlCEuXGgV45zGRKOtTouJxvtb8oliGjC6Af90zAX+sok/zswT51BECFbu68c
         j0OKdxDqe3qccMuoDbzAjc1LTLCE0UJ+xJaKNmdIy7lyygjqpxijybOQAMllb+gOM558
         m6rQc0Q09sVDy8u5HhoawSiKlTtBGHvotD5XHESfSL4OwM7li3peDnGotzs+lrcgdefk
         KnhH+nCLM6aYlhu5AEpy5gWwNMecTTSg9jWAwAjcSKd+8uLQ1rAQpd1BSjYsEXkyvpEa
         LZwInNZD3Oyzc1VBW/scGo5EtFBG5N5yr7KU1EhpVgdSrvJXYGCW4MdwVh4JWy6Q7sGJ
         zInQ==
X-Gm-Message-State: AOAM533yA9RUo5UOkKKl7w5zVlHal2AWvUX1BfSk0XM0w4Qh5sqBz/ZE
        RWo/WoSaMGAXxPNsxY9p7Sqveiyd
X-Google-Smtp-Source: 
 ABdhPJzMwBW9RdJ4Q8fGNviLRNDgPbhlz+rGiWlQfIwSO+KWQBDABZBx6e1bZY2jjETBgpklpNGImw==
X-Received: by 2002:a1c:4b0f:: with SMTP id y15mr2270196wma.83.1590742263858;
        Fri, 29 May 2020 01:51:03 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.02
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:03 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value
Date: Fri, 29 May 2020 10:50:09 +0200
Message-Id: <20200529085038.26008-6-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

ll_diff_tree_oid() has only ever returned 0 [1], so it's return value
is basically useless.  It's only caller diff_tree_oid() has only ever
returned the return value of ll_diff_tree_oid() as-is [2], so its
return value is just as useless.  Most of diff_tree_oid()'s callers
simply ignore its return value, except:

  - diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and
    returns with its return value, but all of diff_root_tree_oid()'s
    callers ignore its return value.

  - rev_compare_tree() and rev_same_tree_as_empty() do look at the
    return value in a condition, but, since the return value is always
    0, the former's < 0 condition is never fulfilled, while the
    latter's >= 0 condition is always fulfilled.

So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid()
and diff_root_tree_oid(), and drop those conditions from
rev_compare_tree() and rev_same_tree_as_empty() as well.

[1] ll_diff_tree_oid() and its ancestors have been returning only 0
    ever since it was introduced as diff_tree() in 9174026cfe (Add
    "diff-tree" program to show which files have changed between two
    trees., 2005-04-09).
[2] diff_tree_oid() traces back to diff-tree.c:main() in 9174026cfe as
    well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 diff.h      | 10 +++++-----
 revision.c  |  9 +++------
 tree-diff.c | 21 +++++++++------------
 3 files changed, 17 insertions(+), 23 deletions(-)

diff --git a/diff.h b/diff.h
index 6febe7e365..a1459de824 100644
--- a/diff.h
+++ b/diff.h
@@ -426,11 +426,11 @@ struct combine_diff_path *diff_tree_paths(
 	struct combine_diff_path *p, const struct object_id *oid,
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt);
-int diff_tree_oid(const struct object_id *old_oid,
-		  const struct object_id *new_oid,
-		  const char *base, struct diff_options *opt);
-int diff_root_tree_oid(const struct object_id *new_oid, const char *base,
-		       struct diff_options *opt);
+void diff_tree_oid(const struct object_id *old_oid,
+		   const struct object_id *new_oid,
+		   const char *base, struct diff_options *opt);
+void diff_root_tree_oid(const struct object_id *new_oid, const char *base,
+			struct diff_options *opt);
 
 struct combine_diff_path {
 	struct combine_diff_path *next;
diff --git a/revision.c b/revision.c
index 8136929e23..9dac1262ef 100644
--- a/revision.c
+++ b/revision.c
@@ -655,15 +655,12 @@ static int rev_compare_tree(struct rev_info *revs,
 
 	tree_difference = REV_TREE_SAME;
 	revs->pruning.flags.has_changes = 0;
-	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
-			   &revs->pruning) < 0)
-		return REV_TREE_DIFFERENT;
+	diff_tree_oid(&t1->object.oid, &t2->object.oid, "", &revs->pruning);
 	return tree_difference;
 }
 
 static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
 {
-	int retval;
 	struct tree *t1 = get_commit_tree(commit);
 
 	if (!t1)
@@ -671,9 +668,9 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
 
 	tree_difference = REV_TREE_SAME;
 	revs->pruning.flags.has_changes = 0;
-	retval = diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning);
+	diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning);
 
-	return retval >= 0 && (tree_difference == REV_TREE_SAME);
+	return tree_difference == REV_TREE_SAME;
 }
 
 struct treesame_state {
diff --git a/tree-diff.c b/tree-diff.c
index 33ded7f8b3..9775e4bcaa 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -29,7 +29,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 	struct combine_diff_path *p, const struct object_id *oid,
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt);
-static int ll_diff_tree_oid(const struct object_id *old_oid,
+static void ll_diff_tree_oid(const struct object_id *old_oid,
 			    const struct object_id *new_oid,
 			    struct strbuf *base, struct diff_options *opt);
 
@@ -673,7 +673,7 @@ static void try_to_follow_renames(const struct object_id *old_oid,
 	q->nr = 1;
 }
 
-static int ll_diff_tree_oid(const struct object_id *old_oid,
+static void ll_diff_tree_oid(const struct object_id *old_oid,
 			    const struct object_id *new_oid,
 			    struct strbuf *base, struct diff_options *opt)
 {
@@ -691,29 +691,26 @@ static int ll_diff_tree_oid(const struct object_id *old_oid,
 	}
 
 	opt->pathchange = pathchange_old;
-	return 0;
 }
 
-int diff_tree_oid(const struct object_id *old_oid,
-		  const struct object_id *new_oid,
-		  const char *base_str, struct diff_options *opt)
+void diff_tree_oid(const struct object_id *old_oid,
+		   const struct object_id *new_oid,
+		   const char *base_str, struct diff_options *opt)
 {
 	struct strbuf base;
-	int retval;
 
 	strbuf_init(&base, PATH_MAX);
 	strbuf_addstr(&base, base_str);
 
-	retval = ll_diff_tree_oid(old_oid, new_oid, &base, opt);
+	ll_diff_tree_oid(old_oid, new_oid, &base, opt);
 	if (!*base_str && opt->flags.follow_renames && diff_might_be_rename())
 		try_to_follow_renames(old_oid, new_oid, &base, opt);
 
 	strbuf_release(&base);
-
-	return retval;
 }
 
-int diff_root_tree_oid(const struct object_id *new_oid, const char *base, struct diff_options *opt)
+void diff_root_tree_oid(const struct object_id *new_oid, const char *base,
+			struct diff_options *opt)
 {
-	return diff_tree_oid(NULL, new_oid, base, opt);
+	diff_tree_oid(NULL, new_oid, base, opt);
 }

From patchwork Fri May 29 08:50:10 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578121
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E9CE2159A
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:14 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id D14FD20776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:14 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="eJeqO5FP"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726812AbgE2IvN (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:13 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44334 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726593AbgE2IvH (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:07 -0400
Received: from mail-wr1-x441.google.com (mail-wr1-x441.google.com
 [IPv6:2a00:1450:4864:20::441])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 05B51C08C5C8
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:07 -0700 (PDT)
Received: by mail-wr1-x441.google.com with SMTP id c3so2486757wru.12
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=xKOEwdMcEgkOfkRknqVUn547aOdqsfbOFi4pIQH+8ng=;
        b=eJeqO5FPGkr/X8z31/o1udXxlqB657ZUzdropdRrlzmBPPIpTtXPNvFUlhtIJ0SuJv
         5Mi2L4IgQXIbLjlPgwmYxYWr1WVUpleYs+6RNr/P6odQMyTxxWGMLQO2CMNO73FdYmT5
         4NFGSag11yOQ9aBMJqQ1tGvLfTIsrVR8iapdd28bGdewoK2KzwdK50TUUtK97fxznVM4
         KFy41cHMSesHN+aq8HyDTCTbHXWKF0BH0NHgzHY3bTXRNojZf0bTmVSCdZfcegHAajAE
         ItGqjcWvHV42+PAZjNwIsDAjD20sSb5Wim+/ZUsb2qQJ9BYuUIbuvFDKAKJ1oW+ykBw/
         HXtQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=xKOEwdMcEgkOfkRknqVUn547aOdqsfbOFi4pIQH+8ng=;
        b=gQKj8UqPtBC9qsqmj0TvtCvUFWScpu6tNvqXkxMs8AGcw5cBvC0fENXwC37Yc7LzVx
         3ROI7x6/9y85HXRcuUgvXNh0WNZzes/UcbG1B0Gx5O1+AwMyaQ258fQy53IP7ibtX69a
         5+6bZHgwyjmngcL5JaG7G8j955yW4oKrvugHk8VDxAx75SejLJ0SxNaGFAgM2MsPrZEB
         NlYPTkIYBVSykD25lwUS0oS9RGtL3Ml0GH20HhwcLu3/+xkCHMqNAd+5emfDOK6MNM/A
         AO9I95H4GPnkvGLkzg13stz67rVbO6vtj59iieEiynaVXlVRQYWsiZODHgFn9VhYbP5p
         23Pw==
X-Gm-Message-State: AOAM532e/t2WC9s5Ryd/OiaN9UBbx21mvCkOkKGRho09p0myQ4XMR9sp
        LESZSks9FV+AkTI90+asjvFlTqmo
X-Google-Smtp-Source: 
 ABdhPJyAHe5tGb8mHfbFP2sPt9U32RwYH9ENd4IfkPoopun/c2jfhhh2efi0nTa9cTri6lJktEo4YQ==
X-Received: by 2002:a5d:6550:: with SMTP id z16mr8209659wrv.392.1590742265470;
        Fri, 29 May 2020 01:51:05 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.04
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:04 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 06/34] commit-graph: clean up #includes
Date: Fri, 29 May 2020 10:50:10 +0200
Message-Id: <20200529085038.26008-7-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Our CodingGuidelines says that it's sufficient to include one of
'git-compat-util.h' and 'cache.h', but both 'commit-graph.c' and
'commit-graph.h' include both.  Let's include only 'git-compat-util.h'
to loose a bunch of unnecessary dependencies; but include 'hash.h',
because 'commit-graph.h' does require the definition of 'struct
object_id'.

'commit-graph.h' explicitly includes 'repository.h' and
'string-list.h', but only needs the declaration of a few structs from
them.  Drop these includes and forward-declare the necessary structs
instead.

'commit-graph.c' includes 'dir.h', but doesn't actually use anything
from there, so let's drop that #include as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 4 +---
 commit-graph.h | 7 ++++---
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 12df74c925..f5ba8a9578 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1,7 +1,5 @@
-#include "cache.h"
-#include "config.h"
-#include "dir.h"
 #include "git-compat-util.h"
+#include "config.h"
 #include "lockfile.h"
 #include "pack.h"
 #include "packfile.h"
diff --git a/commit-graph.h b/commit-graph.h
index 7f5c933fa2..847cd25cfc 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -2,14 +2,15 @@
 #define COMMIT_GRAPH_H
 
 #include "git-compat-util.h"
-#include "repository.h"
-#include "string-list.h"
-#include "cache.h"
+#include "hash.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
 
 struct commit;
+struct repository;
+struct raw_object_store;
+struct string_list;
 
 char *get_commit_graph_filename(const char *obj_dir);
 int open_commit_graph(const char *graph_file, int *fd, struct stat *st);

From patchwork Fri May 29 08:50:11 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578119
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 98337739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:14 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 7EFD420776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:14 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="jaJRg7bu"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726827AbgE2IvN (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:13 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44342 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726518AbgE2IvI (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:08 -0400
Received: from mail-wr1-x42d.google.com (mail-wr1-x42d.google.com
 [IPv6:2a00:1450:4864:20::42d])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9AEACC08C5C9
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:08 -0700 (PDT)
Received: by mail-wr1-x42d.google.com with SMTP id l11so2624480wru.0
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=o8xepLZo0NYKdTng3WvBSyk5T9wPMicb/k259/cHwj0=;
        b=jaJRg7buOkHVR5s9RA0PW6fCc3M4WqJj+dVq2uSdZnURttFzvHI+4THxr96CHckbLS
         hNrfCazuCALOdKKUcJSh9uFV3DLvOfL4+M51gVZjuydQgRp4HKGw/nLyKd0jne9a8tJw
         dKRi5DsmhrtuFnZSOvoa1Ru+VBQMOkl1AG7H0IUTBi60K2kQy+6iTCCW1JwIoi7v7K8l
         XKAFUXWfZ8uasealGmsb7JUfGv3EWxy4SYBkF60u/mu9LlWy89Yr5H2elY9wPsx/XZv0
         vSVvmzYsgB+1hoK5MgVeMpb+1knhSxyzIdvsrvqzY8dB+9Ba8Ukl6/2HKW0A+BWdgS7k
         0D9w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=o8xepLZo0NYKdTng3WvBSyk5T9wPMicb/k259/cHwj0=;
        b=dbDByqNGJjSubBpd+wBlTvnN4jtr/10dQLTDB6hvhjvPfHbx8iaxatRQeK64PpD/Ku
         ZK797taLup/oRZOj3o2bsad6MUsmmAqkL2K1CtLJ/R2vTPQN37n5HoP6g4NHcQuDlTXs
         X3JanmLO+zhBbMsAZC+Q/+yHJwrRT3EghU0QkTxeLxZGiJbcz9kK5Fs5mId/UmNJjVKB
         5lHJsYP9bXQfYB+CL+yYGj06YOBnQBTO2k9Q4I6aqKym3c5/5H+4T8+1dE0sgDJgX2A8
         mBt1sn1iWVgaIJMsH9q89Qnh70p0sV9g5WwQqnEZucvqK5X9ZIIffMHeQlgJOYAXiAzp
         K+Jw==
X-Gm-Message-State: AOAM530yFGbs8gPJ3YtbVLwlIS+U4cRPTZnuvG59ZhCy7P88sth6WD8I
        z768n/bCRdb1l81SHleVeEsf1/ep
X-Google-Smtp-Source: 
 ABdhPJwo29LvCTWEATIF/4rQDoc9llgurlv7v3RaO/VfzkWdzSK7pk2UjfucM2B31BSeZGK5RXsUng==
X-Received: by 2002:adf:aa42:: with SMTP id q2mr4921190wrd.360.1590742266987;
        Fri, 29 May 2020 01:51:06 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.05
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:06 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1
Date: Fri, 29 May 2020 10:50:11 +0200
Message-Id: <20200529085038.26008-8-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

While we iterate over all entries of the Chunk Lookup table we make
sure that we don't attempt to read past the end of the mmap-ed
commit-graph file, and check in each iteration that the chunk ID and
offset we are about to read is still within the mmap-ed memory region.
However, these checks in each iteration are not really necessary,
because the number of chunks in the commit-graph file is already known
before this loop from the just parsed commit-graph header.

So let's check that the commit-graph file is large enough for all
entries in the Chunk Lookup table before we start iterating over those
entries, and drop those per-iteration checks.  While at it, take into
account the size of everything that is necessary to have a valid
commit-graph file, i.e. the size of the header, the size of the
mandatory OID Fanout chunk, and the size of the signature in the
trailer as well.

Note that this necessitates the change of the error message as well,
and, consequently, have to update the 'detect incorrect chunk count'
test in 't5318-commit-graph.sh' as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c          | 16 +++++++++-------
 t/t5318-commit-graph.sh |  3 ++-
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f5ba8a9578..69facf3bf5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -217,6 +217,15 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 	graph->data = graph_map;
 	graph->data_len = graph_size;
 
+	if (graph_size < GRAPH_HEADER_SIZE +
+			 (graph->num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH +
+			 GRAPH_FANOUT_SIZE + the_hash_algo->rawsz) {
+		error(_("commit-graph file is too small to hold %u chunks"),
+		      graph->num_chunks);
+		free(graph);
+		return NULL;
+	}
+
 	last_chunk_id = 0;
 	last_chunk_offset = 8;
 	chunk_lookup = data + 8;
@@ -225,13 +234,6 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 		uint64_t chunk_offset;
 		int chunk_repeated = 0;
 
-		if (data + graph_size - chunk_lookup <
-		    GRAPH_CHUNKLOOKUP_WIDTH) {
-			error(_("commit-graph chunk lookup table entry missing; file may be incomplete"));
-			free(graph);
-			return NULL;
-		}
-
 		chunk_id = get_be32(chunk_lookup + 0);
 		chunk_offset = get_be64(chunk_lookup + 4);
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3f03de6018..b9a5101c09 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -572,7 +572,8 @@ test_expect_success 'detect invalid checksum hash' '
 
 test_expect_success 'detect incorrect chunk count' '
 	corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\377" \
-		"chunk lookup table entry missing" $GRAPH_CHUNK_LOOKUP_OFFSET
+		"commit-graph file is too small to hold [0-9]* chunks" \
+		$GRAPH_CHUNK_LOOKUP_OFFSET
 '
 
 test_expect_success 'git fsck (checks commit-graph)' '

From patchwork Fri May 29 08:50:12 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578127
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A49E2739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:19 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 8A5B8207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:19 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="niADDxFK"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726861AbgE2IvP (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:15 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44346 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726415AbgE2IvK (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:10 -0400
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DAD89C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:09 -0700 (PDT)
Received: by mail-wr1-x443.google.com with SMTP id l11so2624698wru.0
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=7ze0baRJzbWZV+UqBej9xtjj9xdJaKXx+x3lfNTIFBw=;
        b=niADDxFKFjxE2e0LB2HXTB7N+3a49MRfnJ+qANsTflXCGdXuUVh6kNLUVwrSMZuwEC
         xzyA2/ofhEwBFTifKO18Ps5t3rsuT5z18quZMl/Ud9umPfEVVJUHxhRdG87fEqoVBbv3
         EIDrK7p7ct6P2ZLq3UHc/kLpUjukMwTNda051gSRDKjr0PIRVoji9fA3Id4yx1he1qc4
         mPpjk/cJIB6znE/8Eu15Ff0ayq3+dnB5Tm81wKfbDpMxoiNM3Dar0qgWDUb1J2PehT3v
         xbdApxvCTuRMPw1P0fJ+vVjRYIrEiUj6UMr4bjPJZlheNQNxl5ZP1sR1tN4wNd/3+Mrg
         iB4Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=7ze0baRJzbWZV+UqBej9xtjj9xdJaKXx+x3lfNTIFBw=;
        b=MrmoKV8Rqembl8TBi4/MH0ucP4IIcJbEmq5SywotRUzHholI0PMZA5fPAPiY/VYquB
         6glQD8EkYAxIHrTLWxM17G7GumD5Sqgi4HtWDbzLAEQhbuSNUyjGazskrerNQbMniRVN
         Lpon//GzReOwIPml7EOTQ362hlNtfhtcEt0UbskV5lPw+yXOrGrtfKgAcu2JhPo6R5SK
         vZGS9ZtAp6DS29xRD+izmulOqKh8QSrAFjP/6BV0zP4zRnVMqdPG0iLL2MCyJeyEsMh8
         SjyRr2v/6Bsttr2x+z0blCjFPRjHRKmwejh3pX9Rcq0NihpXf1oyjGFD6W89yWzdTfJy
         epXA==
X-Gm-Message-State: AOAM5326BQb/75cohGPmUqRpeJpwu3vIH47XSBdB+YZrsXvcpXNLGyw+
        7DLZ0xTbz3Tlv9J4VznSzpazEKav
X-Google-Smtp-Source: 
 ABdhPJxVPPRiuWC8FipdQzubfPk4uxwtkonF+6UrnFAJpKD72WYIB4EWvwmR09X2bf/eL0PA/u1oBA==
X-Received: by 2002:a5d:4d0d:: with SMTP id z13mr8177904wrt.220.1590742268336;
        Fri, 29 May 2020 01:51:08 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.07
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:07 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2
Date: Fri, 29 May 2020 10:50:12 +0200
Message-Id: <20200529085038.26008-9-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

The Chunk Lookup table stores the chunks' starting offset in the
commit-graph file, not their sizes.  Consequently, the size of a chunk
can only be calculated by subtracting its offset from the offset of
the subsequent chunk (or that of the terminating label).  This is
currenly implemented in a bit complicated way: as we iterate over the
entries of the Chunk Lookup table, we check the id of each chunk and
store its starting offset, then we check the id of the last seen chunk
and calculate its size using its previously saved offset.  At the
moment there is only one chunk for which we calculate its size, but
this patch series will add more, and the repeated chunk id checks are
not that pretty.

Instead let's read ahead the offset of the next chunk on each
iteration, so we can calculate the size of each chunk right away,
right where we store its starting offset.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 69facf3bf5..f64b1c01a8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -175,8 +175,7 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 	const unsigned char *data, *chunk_lookup;
 	uint32_t i;
 	struct commit_graph *graph;
-	uint64_t last_chunk_offset;
-	uint32_t last_chunk_id;
+	uint64_t next_chunk_offset;
 	uint32_t graph_signature;
 	unsigned char graph_version, hash_version;
 
@@ -226,16 +225,17 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 		return NULL;
 	}
 
-	last_chunk_id = 0;
-	last_chunk_offset = 8;
 	chunk_lookup = data + 8;
-	for (i = 0; i <= graph->num_chunks; i++) {
+	next_chunk_offset = get_be64(chunk_lookup + 4);
+	for (i = 0; i < graph->num_chunks; i++) {
 		uint32_t chunk_id;
 		uint64_t chunk_offset;
 		int chunk_repeated = 0;
 
 		chunk_id = get_be32(chunk_lookup + 0);
-		chunk_offset = get_be64(chunk_lookup + 4);
+		chunk_offset = next_chunk_offset;
+		next_chunk_offset = get_be64(chunk_lookup + 4 +
+					     GRAPH_CHUNKLOOKUP_WIDTH);
 
 		chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
 
@@ -257,8 +257,11 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 		case GRAPH_CHUNKID_OIDLOOKUP:
 			if (graph->chunk_oid_lookup)
 				chunk_repeated = 1;
-			else
+			else {
 				graph->chunk_oid_lookup = data + chunk_offset;
+				graph->num_commits = (next_chunk_offset - chunk_offset)
+						     / graph->hash_len;
+			}
 			break;
 
 		case GRAPH_CHUNKID_DATA:
@@ -287,15 +290,6 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 			free(graph);
 			return NULL;
 		}
-
-		if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP)
-		{
-			graph->num_commits = (chunk_offset - last_chunk_offset)
-					     / graph->hash_len;
-		}
-
-		last_chunk_id = chunk_id;
-		last_chunk_offset = chunk_offset;
 	}
 
 	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);

From patchwork Fri May 29 08:50:13 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578123
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 61284912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:17 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 4901220776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:17 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="NrbSnns7"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726865AbgE2IvQ (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:16 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44356 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726803AbgE2IvN (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:13 -0400
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EA898C08C5C6
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:11 -0700 (PDT)
Received: by mail-wr1-x443.google.com with SMTP id y17so2501483wrn.11
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=Aa09xbF1s6CPu0fooG6l3ZNN6OOcSLl+f/MT/1t9fzo=;
        b=NrbSnns7XGCd3f5VGLNccKufytHLcg8FcQ/pT/3wlx0cAjEAUi0MH4+HAJ0/ZAZL4n
         kWLhClzlTGNHsIyHH5z5bzbg2KOFghVmgYfxqJ1myyVUX3bB21Y3JCWKAPOmJz2BFdUj
         UtSQJ5ZRpm0mC7mZ0mOv/p9tl2LP5/EiTHhB56R12iqOJyYL5xKCQ5zwlRa9LJRgtwqR
         f0N03b7UZVcp+ZO4Ig3nbcz4vZ3+Tt1uXb5lXAormV9TOh2hwZH4j2qGAO1RayGIjsu6
         P0zAcc9I9um/DvgbyFeBkK9tM5oj+SGPqIs6SzC9iYDznsOKIzPJWJUhV0K1ThZ+rAxM
         9vCg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=Aa09xbF1s6CPu0fooG6l3ZNN6OOcSLl+f/MT/1t9fzo=;
        b=YGB07IGP/u4cTAu3i1vZv/M/HiNxzBQTNu8NIamlQxRaukh9/SqDaYyRj7wD2rk1xu
         c190lBnjf1HPTO5UIe6jCXaKRIpN+5kJ7wLs1FabpmtIsEKO68r9woRklBySJdiryRgH
         b8Nx6xdSIGIbUamMTcHd0knShY0x1zZfzy3VjIARZLDjGiXJ8ht9F+y+6UR/YXLHH0CG
         e5OJsVdZRO1m9WHaEqQZ5XlSeTAaNs+nczq9jm/W9vNvu4ejLDvMfaUzJckuNBV5PecP
         iSs1PsWenacCkEmues+MN7ijzAVVxVQ9nyBc7yz6BC/3WX3/zH7yoDqnYL0b+CMYCD7L
         baHw==
X-Gm-Message-State: AOAM530zqxxQCu3JYWdIiR+tEMrw7SxHUyXZCYa5fO/wS7SUtOvqOM3h
        7loY+WQDwiCypG3luQqWtovN46IV
X-Google-Smtp-Source: 
 ABdhPJyhV68CbBUUYa7qMwo3P13VLjFkkBY0ubc6QvT3cJsi4JqgQr2aw1Ij2+K8E25W0nXnQgN2NQ==
X-Received: by 2002:adf:f882:: with SMTP id u2mr8164739wrp.40.1590742269814;
        Fri, 29 May 2020 01:51:09 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.08
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:09 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1
Date: Fri, 29 May 2020 10:50:13 +0200
Message-Id: <20200529085038.26008-10-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

In write_commit_graph_file() one block of code fills the array of
chunk IDs, another block of code fills the array of chunk offsets,
then the chunk IDs and offsets are written to the Chunk Lookup table,
and finally a third block of code writes the actual chunks.  In case
of optional chunks like Extra Edge List and Base Graphs List there is
also a condition checking whether that chunk is necessary/desired, and
that same condition is repeated in all those three blocks of code.
This patch series is about to add more optional chunks, so there would
be even more repeated conditions.

Those chunk offsets are relative to the beginning of the file, so they
inherently depend on the size of the Chunk Lookup table, which in turn
depends on the number of chunks that are to be written to the
commit-graph file.  IOW at the time we set the first chunk's ID we
can't yet know its offset, because we don't yet know how many chunks
there are.

Simplify this by initially filling an array of chunk sizes, not
offsets, and calculate the offsets based on the chunk sizes only
later, while we are writing the Chunk Lookup table.  This way we can
fill the arrays of chunk IDs and sizes in one go, eliminating one set
of repeated conditions.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f64b1c01a8..416f4b2468 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1351,10 +1351,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
 	uint32_t chunk_ids[6];
-	uint64_t chunk_offsets[6];
+	uint64_t chunk_sizes[6];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
+	uint64_t chunk_offset;
 	struct object_id file_hash;
 
 	if (ctx->split) {
@@ -1394,35 +1395,24 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	}
 
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
+	chunk_sizes[0] = GRAPH_FANOUT_SIZE;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
+	chunk_sizes[1] = hashsz * ctx->commits.nr;
 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
+	chunk_sizes[2] = (hashsz + 16) * ctx->commits.nr;
 	if (ctx->num_extra_edges) {
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
+		chunk_sizes[3] = 4 * ctx->num_extra_edges;
 		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
+		chunk_sizes[4] = hashsz * (ctx->num_commit_graphs_after - 1);
 		num_chunks++;
 	}
 
 	chunk_ids[num_chunks] = 0;
-
-	chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
-	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
-	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
-	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
-
-	num_chunks = 3;
-	if (ctx->num_extra_edges) {
-		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
-						4 * ctx->num_extra_edges;
-		num_chunks++;
-	}
-	if (ctx->num_commit_graphs_after > 1) {
-		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
-						hashsz * (ctx->num_commit_graphs_after - 1);
-		num_chunks++;
-	}
+	chunk_sizes[num_chunks] = 0;
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
@@ -1431,13 +1421,16 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
+	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
 
 		chunk_write[0] = htonl(chunk_ids[i]);
-		chunk_write[1] = htonl(chunk_offsets[i] >> 32);
-		chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff);
+		chunk_write[1] = htonl(chunk_offset >> 32);
+		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
 		hashwrite(f, chunk_write, 12);
+
+		chunk_offset += chunk_sizes[i];
 	}
 
 	if (ctx->report_progress) {

From patchwork Fri May 29 08:50:14 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578125
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 87B4C912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:18 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 6FECE20776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:18 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="nUUcHuct"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726877AbgE2IvQ (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:16 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44358 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726451AbgE2IvN (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:13 -0400
Received: from mail-wr1-x441.google.com (mail-wr1-x441.google.com
 [IPv6:2a00:1450:4864:20::441])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F031EC08C5C8
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:12 -0700 (PDT)
Received: by mail-wr1-x441.google.com with SMTP id l11so2625210wru.0
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=ZSLeKvlclJG8zvTmUNsJgF/OSsfTGEM3yhZ9lBbV2kM=;
        b=nUUcHuctVCNXEQoauEaTVwgtUbYCqXOMLwtVLLUjsnxYwec/LLFUHnd7Qb5rGmnse2
         khcpglxy38mIE3wdJfyK3crn8EVv47wpA+O+fsEtxJd8FJTkzwTjnheUGEMCFtBPWOs/
         TJwLYAJngAG97RxF+JUVW5fkgw2wlUs+bzcNs+H6TuLDzMxx4dWUN6GMphu1B09Is3LM
         BNHvea7t86gUftDaL8vECtB4fx/VPT46QTfFXse9DCM3zaiywtsMmD4QpOBlKoXVUK38
         f4Qt390ahEI6nbh/eamSURGgxoMpAtx8lxlZPQB95KitUcaL6F6U2r/eSwLWuHQoEXkB
         v1NA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=ZSLeKvlclJG8zvTmUNsJgF/OSsfTGEM3yhZ9lBbV2kM=;
        b=cx3ky8l+dfiPFeq8PIhAGyNzgAQw8rfzvZxQr6OP4XFQnI0mTo1GLcYxaD+opz4zx/
         8LHaI3ZrR+/1lYPtdH6IiflnMB6ZB+1tDLewUQHoyE8ouuDXS3qd3QfppmSPqCMv0s3C
         Wq/SznXXifD+IBbO7oz1O0jkH0ggyUmL4jjtbSV9w3uPDEJ2MokywQpegeyy2l4AkBNY
         Nft/6JwCcc23lsPE5RfwbTmI/ILrp8K23aDSWk4qjxETGlSfzz4u2A2xWGeTjeSQkGRV
         hS/93XpPRYGQ5sAnrhSbsZVqTz9xbJ9CY5GLMamB1p7GinWNKI8NVhg8WCkNZOc8qJJk
         WZ2A==
X-Gm-Message-State: AOAM532ntN74V/bO/Hf/j8xYWdtScqUBJdAgukM9gtLRMdoEpAXGVLlQ
        ik8DLAzN0WSx3tA4nsV9HmKkW0nF
X-Google-Smtp-Source: 
 ABdhPJzWyr6ywqAGxZLyuPbaAUff/CzB3Z7AlVFCqjh2ZegAcKNzqDPj+Eee8TFb+KHwNe8CDz9jtg==
X-Received: by 2002:a5d:4d89:: with SMTP id b9mr8297197wru.210.1590742271436;
        Fri, 29 May 2020 01:51:11 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.09
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:10 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2
Date: Fri, 29 May 2020 10:50:14 +0200
Message-Id: <20200529085038.26008-11-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Unify the 'chunk_ids' and 'chunk_sizes' arrays into an array of
'struct chunk_info'.  This will allow more cleanups in the following
patches.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 416f4b2468..d95c739c10 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1344,14 +1344,18 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
+struct chunk_info {
+	uint32_t id;
+	uint64_t size;
+};
+
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
 	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	uint32_t chunk_ids[6];
-	uint64_t chunk_sizes[6];
+	struct chunk_info chunks[6];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
@@ -1394,25 +1398,25 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	}
 
-	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
-	chunk_sizes[0] = GRAPH_FANOUT_SIZE;
-	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
-	chunk_sizes[1] = hashsz * ctx->commits.nr;
-	chunk_ids[2] = GRAPH_CHUNKID_DATA;
-	chunk_sizes[2] = (hashsz + 16) * ctx->commits.nr;
+	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
+	chunks[0].size = GRAPH_FANOUT_SIZE;
+	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
+	chunks[1].size = hashsz * ctx->commits.nr;
+	chunks[2].id = GRAPH_CHUNKID_DATA;
+	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
 	if (ctx->num_extra_edges) {
-		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
-		chunk_sizes[3] = 4 * ctx->num_extra_edges;
+		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
+		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
 		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
-		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
-		chunk_sizes[4] = hashsz * (ctx->num_commit_graphs_after - 1);
+		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
+		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
 		num_chunks++;
 	}
 
-	chunk_ids[num_chunks] = 0;
-	chunk_sizes[num_chunks] = 0;
+	chunks[num_chunks].id = 0;
+	chunks[num_chunks].size = 0;
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
@@ -1425,12 +1429,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
 
-		chunk_write[0] = htonl(chunk_ids[i]);
+		chunk_write[0] = htonl(chunks[i].id);
 		chunk_write[1] = htonl(chunk_offset >> 32);
 		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
 		hashwrite(f, chunk_write, 12);
 
-		chunk_offset += chunk_sizes[i];
+		chunk_offset += chunks[i].size;
 	}
 
 	if (ctx->report_progress) {

From patchwork Fri May 29 08:50:15 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578129
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BD16C912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:20 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id A4B3720776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:20 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="BwlSmuDN"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726891AbgE2IvT (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:19 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44346 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726829AbgE2IvP (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:15 -0400
Received: from mail-wr1-x442.google.com (mail-wr1-x442.google.com
 [IPv6:2a00:1450:4864:20::442])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 20632C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:15 -0700 (PDT)
Received: by mail-wr1-x442.google.com with SMTP id c3so2488140wru.12
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=3AwbeNjCGkrC2eS1uZk0En1qi4xz1nodTJdEFIybPKA=;
        b=BwlSmuDNLImKRYU7La69Brv2yUw3T4+oTc+ZZdYEopSF83hUo7yV4Zs2gAnkqxNjph
         IWVqtoMf1PLkQwp+xdA/+kTBCqNWkJ8ZGMO0q8t423VNw/HC/NaTF7V+pumAHZ+GwauS
         aW9Uh0F0ujNrCOW+VFtw48sn/9WgSTOXG7gAbCxYXvTTge44wCmZ+YmgJJ2aYRr6gUBA
         Rt4cUNxY3E/gNYKKPnYfhaOdp0zZcEyufUQacwbUq+MIwwrqrlUUY6oa14WS9xIMNTrR
         rMUyUr/rd2w40tfdwnCpPCNKRWRJMWCXNTYtk/DWxK944HnVqa9Bx7VqaonKSwZZ5EVo
         AAgQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=3AwbeNjCGkrC2eS1uZk0En1qi4xz1nodTJdEFIybPKA=;
        b=skNBBVEQYwVStOgr5+cOC5dD7KTahzPuLHHgHWVHUjKq4tPPPgKJ59SULsn4NfZbDt
         KUumZ/7RX2bi2w2kJAtt3+ho39jac8DVmosruwXY1sgiWOOa5e/PDi6muzwIGpSmMbmq
         nkSrqrjIndqS+w/8ioFPomBEC3K3w+JshpyS7bWOmWNGhx5ACk30k5YV4qogMhi2fCS5
         Nb+1Mv3DCgX0Vz0gx2xoeDA8zDeNAlrtUH2A0S8rrToUVK0MZldPsHHvl3FHYID97K7k
         4DGdzkr9J9paIokoSFDtFx/c63gQJZtsnVCYVPPrlAeISnh49JPryaYxKg2P7UJnm3we
         YxvA==
X-Gm-Message-State: AOAM533m0kcXk+H25s5thEnrWQv4LcfnLLuWDS6ksE2G5RS+c05XxYvS
        yH8Ec/umP6OuWQE53hF77JL5gIbg
X-Google-Smtp-Source: 
 ABdhPJwptIIUucDuaePpfZNNYJtCWIC1BUG3fn9SFPgfCNVX9REW/XNlnjBGykZvA7oyT7R7fII9nQ==
X-Received: by 2002:a05:6000:12c4:: with SMTP id
 l4mr8187491wrx.128.1590742273432;
        Fri, 29 May 2020 01:51:13 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.11
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:12 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array
 dinamically
Date: Fri, 29 May 2020 10:50:15 +0200
Message-Id: <20200529085038.26008-12-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

During writing a new commit-graph file, the size of the 'struct
chunk_info' array is hardcoded to be the maximum number of possible
chunks plus one for the terminating label.  This array size must be
adjusted each time when a new chunk is added, which can be easily
forgotten.

So let's make the constant-sized 'struct chunk_info' array an
ALLOC_GROW()-able array instead.

There are a couple of cases where writing a new commit-graph file
returns with error; those spots were adjusted to free the dinamically
allocated 'struct chunk_info' array.

Arguably the several ALLOC_GROW() call sites look unusal, because we
usually call ALLOC_GROW() in loops, once in each iteration...
However, this way anyone who adds a new chunk will notice that they
have to cargo-cult the ALLOC_GROW() call as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 74 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 27 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d95c739c10..c8ba17e457 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1355,12 +1355,13 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	struct chunk_info chunks[6];
+	struct chunk_info *chunks = NULL;
+	int chunks_nr = 0, chunks_alloc = 0;
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
+	int ret = 0;
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1398,35 +1399,48 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	}
 
-	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
-	chunks[0].size = GRAPH_FANOUT_SIZE;
-	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
-	chunks[1].size = hashsz * ctx->commits.nr;
-	chunks[2].id = GRAPH_CHUNKID_DATA;
-	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
+	ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
+	chunks[chunks_nr].id = GRAPH_CHUNKID_OIDFANOUT;
+	chunks[chunks_nr].size = GRAPH_FANOUT_SIZE;
+	chunks_nr++;
+	ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
+	chunks[chunks_nr].id = GRAPH_CHUNKID_OIDLOOKUP;
+	chunks[chunks_nr].size = hashsz * ctx->commits.nr;
+	chunks_nr++;
+	ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
+	chunks[chunks_nr].id = GRAPH_CHUNKID_DATA;
+	chunks[chunks_nr].size = (hashsz + 16) * ctx->commits.nr;
+	chunks_nr++;
 	if (ctx->num_extra_edges) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
-		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
-		num_chunks++;
+		ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
+		chunks[chunks_nr].id = GRAPH_CHUNKID_EXTRAEDGES;
+		chunks[chunks_nr].size = 4 * ctx->num_extra_edges;
+		chunks_nr++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
-		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
-		num_chunks++;
+		ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
+		chunks[chunks_nr].id = GRAPH_CHUNKID_BASE;
+		chunks[chunks_nr].size = hashsz * (ctx->num_commit_graphs_after - 1);
+		chunks_nr++;
 	}
 
-	chunks[num_chunks].id = 0;
-	chunks[num_chunks].size = 0;
+	ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
+	chunks[chunks_nr].id = 0;
+	chunks[chunks_nr].size = 0;
+	/*
+	 * Do not increase 'chunks_nr', so it still reflects the number of
+	 * actual chunks, without the Chunk Lookup table's terminating label.
+	 */
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, chunks_nr);
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
-	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
-	for (i = 0; i <= num_chunks; i++) {
+	chunk_offset = 8 + (chunks_nr + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
+	for (i = 0; i <= chunks_nr; i++) {
 		uint32_t chunk_write[3];
 
 		chunk_write[0] = htonl(chunks[i].id);
@@ -1441,11 +1455,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		strbuf_addf(&progress_title,
 			    Q_("Writing out commit graph in %d pass",
 			       "Writing out commit graph in %d passes",
-			       num_chunks),
-			    num_chunks);
+			       chunks_nr),
+			    chunks_nr);
 		ctx->progress = start_delayed_progress(
 			progress_title.buf,
-			num_chunks * ctx->commits.nr);
+			chunks_nr * ctx->commits.nr);
 	}
 	write_graph_chunk_fanout(f, ctx);
 	write_graph_chunk_oids(f, hashsz, ctx);
@@ -1454,7 +1468,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->num_commit_graphs_after > 1 &&
 	    write_graph_chunk_base(f, ctx)) {
-		return -1;
+		ret = -1;
+		goto cleanup;
 	}
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
@@ -1481,7 +1496,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 		if (!chainf) {
 			error(_("unable to open commit-graph chain file"));
-			return -1;
+			ret = -1;
+			goto cleanup;
 		}
 
 		if (ctx->base_graph_name) {
@@ -1493,7 +1509,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 				if (result) {
 					error(_("failed to rename base commit-graph file"));
-					return -1;
+					ret = -1;
+					goto cleanup;
 				}
 			}
 		} else {
@@ -1513,13 +1530,16 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 		if (result) {
 			error(_("failed to rename temporary commit-graph file"));
-			return -1;
+			ret = -1;
+			goto cleanup;
 		}
 	}
 
 	commit_lock_file(&lk);
 
-	return 0;
+cleanup:
+	free(chunks);
+	return ret;
 }
 
 static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)

From patchwork Fri May 29 08:50:16 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578131
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B0DB9739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:22 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 92E4520776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:22 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="BuNCFk1e"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726893AbgE2IvV (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:21 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44370 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726451AbgE2IvR (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:17 -0400
Received: from mail-wm1-x344.google.com (mail-wm1-x344.google.com
 [IPv6:2a00:1450:4864:20::344])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A8C16C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:16 -0700 (PDT)
Received: by mail-wm1-x344.google.com with SMTP id f5so2446108wmh.2
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=1TeN09yRt1/4eMt06pjBn+7kZntign7fMUBusi+VkRU=;
        b=BuNCFk1eD8ZwH4ggbvN9D/D8uXNYRfzdbroFBRhTbsJNo1mOj6+yWJ/zMo+kAf9qJ5
         v/56UK51asXIczuW0WtCZwYkrBU3+9AIJueXb8ZoL8KBuIcVsCTmogPEyYVAwA19QE5c
         0Sw6Jq8/6LyaRVwMDGRwyofApJIa+XquDbI+EQwvscjnf7Qw1sG0cr+sOBZoaRQG/hZr
         1geSrn5zbOPl71GKdTolr4WH4w8/kZqyAtO9N26ZDUKTlSFvr9DS7taUkxaN5hd/nyVT
         jVfRRSwutCucJ5nnB4qujPNprMge7LA4Go0BMBLdGhCagpuaKdGidVibqEMT3XHhNiM1
         yi4A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=1TeN09yRt1/4eMt06pjBn+7kZntign7fMUBusi+VkRU=;
        b=PBjqC3B1zNTOewagydT3NXHAyIl+ETGNPfgvuHmlgoKtpM+XrngbncCaLUc0JHPoZE
         ZQeKcN8W8DdGRfoi+mwLbtQGCq3BhGttjcqIgBylBOFP96A+fXLbziDbMR4C9Lu8Eb28
         oQBIdOJGJ6Q8NXgvZvTYUxrz1Ub3xr5v7a9ltApbxv7d89hAPksvIJZMTbGoRqAFCE01
         b3VzwW+72Jge51E4o+cFgOvzdVNmUYrqaQIHpcaVNzNTXyhYjigQH5Y0Acrl1fGDHZ4s
         69GYEzI6t6yvHDMjWYXVY7YmbgN1WDIJesctQmboWuw+h/xNQYPFFyRfbvHCMPzYn1qV
         f2sA==
X-Gm-Message-State: AOAM5332J/0qEOQn460ThWB8kbna1xt8s6dbKUZMMNN8sVKcB7H7q3P6
        rpvvqSedVMdGKcs9LhYuibszbuaH
X-Google-Smtp-Source: 
 ABdhPJzgjVLqpL1kwjHzrK09KQF1BjIQ+fmj8S8gs+j++ViqjqSDe4H2rrYqPrK28U+7lZ5+R3wUxw==
X-Received: by 2002:a1c:df57:: with SMTP id w84mr7985604wmg.52.1590742275085;
        Fri, 29 May 2020 01:51:15 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.13
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:14 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 12/34] commit-graph: unify the signatures of all
 write_graph_chunk_*() functions
Date: Fri, 29 May 2020 10:50:16 +0200
Message-Id: <20200529085038.26008-13-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Update the write_graph_chunk_*() helper functions to have the same
signature:

  - Return an int error code from all these functions.
    write_graph_chunk_base() already has an int error code, now the
    others will have one, too, but since they don't indicate any
    error, they will always return 0.

  - Drop the hash size parameter of write_graph_chunk_oids() and
    write_graph_chunk_data(); its value can be read directly from
    'the_hash_algo' inside these functions as well.

This opens up the possibility for further cleanups and foolproofing in
the following two patches.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c8ba17e457..33e70bdfea 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -793,7 +793,7 @@ struct write_commit_graph_context {
 	const struct split_commit_graph_opts *split_opts;
 };
 
-static void write_graph_chunk_fanout(struct hashfile *f,
+static int write_graph_chunk_fanout(struct hashfile *f,
 				     struct write_commit_graph_context *ctx)
 {
 	int i, count = 0;
@@ -815,17 +815,19 @@ static void write_graph_chunk_fanout(struct hashfile *f,
 
 		hashwrite_be32(f, count);
 	}
+	return 0;
 }
 
-static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
+static int write_graph_chunk_oids(struct hashfile *f,
 				   struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
+		hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz);
 	}
+	return 0;
 }
 
 static const unsigned char *commit_to_sha1(size_t index, void *table)
@@ -834,7 +836,7 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 	return commits[index]->object.oid.hash;
 }
 
-static void write_graph_chunk_data(struct hashfile *f, int hash_len,
+static int write_graph_chunk_data(struct hashfile *f,
 				   struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
@@ -852,7 +854,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 			die(_("unable to parse commit %s"),
 				oid_to_hex(&(*list)->object.oid));
 		tree = get_commit_tree_oid(*list);
-		hashwrite(f, tree->hash, hash_len);
+		hashwrite(f, tree->hash, the_hash_algo->rawsz);
 
 		parent = (*list)->parents;
 
@@ -932,9 +934,10 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 
 		list++;
 	}
+	return 0;
 }
 
-static void write_graph_chunk_extra_edges(struct hashfile *f,
+static int write_graph_chunk_extra_edges(struct hashfile *f,
 					  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
@@ -984,6 +987,7 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 
 		list++;
 	}
+	return 0;
 }
 
 static int oid_compare(const void *_a, const void *_b)
@@ -1462,8 +1466,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			chunks_nr * ctx->commits.nr);
 	}
 	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, hashsz, ctx);
-	write_graph_chunk_data(f, hashsz, ctx);
+	write_graph_chunk_oids(f, ctx);
+	write_graph_chunk_data(f, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->num_commit_graphs_after > 1 &&

From patchwork Fri May 29 08:50:17 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578133
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4DA7A159A
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:23 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 375FB207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:23 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="mCzW2HZ8"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726905AbgE2IvW (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:22 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44376 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726878AbgE2IvS (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:18 -0400
Received: from mail-wm1-x344.google.com (mail-wm1-x344.google.com
 [IPv6:2a00:1450:4864:20::344])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 22AB5C08C5C6
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:18 -0700 (PDT)
Received: by mail-wm1-x344.google.com with SMTP id k26so2435108wmi.4
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=pxj6OC6L7CyhCiORkdwWsxjGstpQzlhALBp4zedrXEQ=;
        b=mCzW2HZ8Mf+NX19aBxAl2RUm+U/0SczJC2b95J69rg9oMtX9PP28LJEmPwPpPISVWm
         2SbFQIlyrCNWT78Ju6U1LVziARY1pH9oAespH9E9hf6uO/1ahRvnCFWE3GwL3FgV+PkV
         WMGZhGjiRSaOoFbIHABSYJlYP2ddfmH7lUW8cAGM+H427R3U4jOP+ZnwaXhUF8FNECzo
         COAlu+XNi+cc2hhyE8Wk7/JlCK8WQWZniQICyn04PPI3LYjD8EgbaNN3j/4rcodYlgX9
         ph9TAvOL6qiUrddK4zN9qXiX23cAynSnQPc/Kt9CB94jJqmWrZNpCO/yNtxxaCSLiNI2
         IDcg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=pxj6OC6L7CyhCiORkdwWsxjGstpQzlhALBp4zedrXEQ=;
        b=okRAwhdeV1S5uhWk4a5/YNJC/niq4qNQV/QcP6Goiy6V+nf/Y4HqB0MGlinPByJQ6b
         gEKQLBmiKryam4Jaokg1Y9F4UmjyRwk55Nc3OCwlNDldu5MxvzDpl+aoqmmg2eT8icJG
         bXj/+rxiIDaVH1NzM7TQr+X0JYlJ/afQP2Fnf6EJb+aOjwGv+95cDcK3MMh9hQk2cscU
         6awb4/J0lGFh8cYwCVLqCQTOoyLsRdnFyDX8FYdQ+HkTeujPcv27BS76+hvyKmlZvkJ0
         nOoRanbmFlcDkh1M4ycWqjcfVGNPIaaVxeQ9De0xFTQNZyq+L6OXT/82ENrfq/LhqXCU
         YIng==
X-Gm-Message-State: AOAM533sSIO7P02OgLP3Kj4NMLsG5cWgOaZgsGlYjYUFLZJ+6xTemNPw
        XnoCl0E57DvCPRDtZrG2sI2UMLzw
X-Google-Smtp-Source: 
 ABdhPJywvTAoJG5A9H/OT4oeSZFkYgGmg4Svy6eH4dHv9jKPRXgZKOlorcoY4sO53a+rY5OBGjgVyQ==
X-Received: by 2002:a1c:e389:: with SMTP id a131mr7782104wmh.46.1590742276469;
        Fri, 29 May 2020 01:51:16 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.15
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:15 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3
Date: Fri, 29 May 2020 10:50:17 +0200
Message-Id: <20200529085038.26008-14-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

In write_commit_graph_file() we now have one block of code filling the
array of 'struct chunk_info' with the IDs and sizes of chunks to be
written, and an other block of code calling the functions responsible
for writing individual chunks.  In case of optional chunks like Extra
Edge List an Base Graphs List there is also a condition checking
whether that chunk is necessary/desired, and that same condition is
repeated in both blocks of code.  This patch series is about to add
more optional chunks, so there would be even more repeated conditions.

Eliminate these repeated conditions by storing the function pointers
responsible for writing individual chunks in the 'struct chunk_info'
array as well, and calling them in a loop to write the commit-graph
file.  This will open up the possibility for a bit of foolproofing in
the following patch.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 33e70bdfea..fe26245fad 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1348,9 +1348,12 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
+typedef int (*chunk_write_fn)(struct hashfile *f,
+			      struct write_commit_graph_context *ctx);
 struct chunk_info {
 	uint32_t id;
 	uint64_t size;
+	chunk_write_fn write_fn;
 };
 
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
@@ -1406,31 +1409,37 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
 	chunks[chunks_nr].id = GRAPH_CHUNKID_OIDFANOUT;
 	chunks[chunks_nr].size = GRAPH_FANOUT_SIZE;
+	chunks[chunks_nr].write_fn = write_graph_chunk_fanout;
 	chunks_nr++;
 	ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
 	chunks[chunks_nr].id = GRAPH_CHUNKID_OIDLOOKUP;
 	chunks[chunks_nr].size = hashsz * ctx->commits.nr;
+	chunks[chunks_nr].write_fn = write_graph_chunk_oids;
 	chunks_nr++;
 	ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
 	chunks[chunks_nr].id = GRAPH_CHUNKID_DATA;
 	chunks[chunks_nr].size = (hashsz + 16) * ctx->commits.nr;
+	chunks[chunks_nr].write_fn = write_graph_chunk_data;
 	chunks_nr++;
 	if (ctx->num_extra_edges) {
 		ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
 		chunks[chunks_nr].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[chunks_nr].size = 4 * ctx->num_extra_edges;
+		chunks[chunks_nr].write_fn = write_graph_chunk_extra_edges;
 		chunks_nr++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
 		chunks[chunks_nr].id = GRAPH_CHUNKID_BASE;
 		chunks[chunks_nr].size = hashsz * (ctx->num_commit_graphs_after - 1);
+		chunks[chunks_nr].write_fn = write_graph_chunk_base;
 		chunks_nr++;
 	}
 
 	ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
 	chunks[chunks_nr].id = 0;
 	chunks[chunks_nr].size = 0;
+	chunks[chunks_nr].write_fn = NULL;
 	/*
 	 * Do not increase 'chunks_nr', so it still reflects the number of
 	 * actual chunks, without the Chunk Lookup table's terminating label.
@@ -1465,16 +1474,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			progress_title.buf,
 			chunks_nr * ctx->commits.nr);
 	}
-	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, ctx);
-	write_graph_chunk_data(f, ctx);
-	if (ctx->num_extra_edges)
-		write_graph_chunk_extra_edges(f, ctx);
-	if (ctx->num_commit_graphs_after > 1 &&
-	    write_graph_chunk_base(f, ctx)) {
-		ret = -1;
-		goto cleanup;
-	}
+	for (i = 0; i < chunks_nr; i++)
+		if (chunks[i].write_fn(f, ctx)) {
+			ret = -1;
+			goto cleanup;
+		}
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 

From patchwork Fri May 29 08:50:18 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578135
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 51591739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:24 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 3B15320810
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:24 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="PcVX5aBT"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726916AbgE2IvX (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:23 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44382 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726892AbgE2IvT (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:19 -0400
Received: from mail-wr1-x442.google.com (mail-wr1-x442.google.com
 [IPv6:2a00:1450:4864:20::442])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84672C08C5C8
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:19 -0700 (PDT)
Received: by mail-wr1-x442.google.com with SMTP id x13so2573183wrv.4
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=cvj8cOV7M1ndq2vXw5ET9iZ08Z+6qHGRnsITK6Jakzk=;
        b=PcVX5aBTN5UWWSbRP6yveaXNWvzfdKpwwy4Fs4LtcNckkpx0ScUeGoP6W9BFccy8H0
         qpHdBFg9ytpza1+oer7AUx6OwfudJ6HjOa5ORuG6kQNaPQ2quzo+gDos7GS6vbgbcnil
         IUppLLZaQT8N+ux4FzMDb/QcxSCwxaXyet2cFJaCzyv5P8VJRHM4HddZ8J8OFgyTWlc8
         y3JjvcenO52clwZZmwS4l1Gc/AOuMJG137RAUfcVV8xireB3m1n8RnD1EvYZF3i+Nk7z
         pBaS+BDymAhP0YhuI7c16FYy75IKL2bl6NQH08xx0dLtGj6z5wmuRaIJR+hkvfWukeyg
         Z42Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=cvj8cOV7M1ndq2vXw5ET9iZ08Z+6qHGRnsITK6Jakzk=;
        b=NsAByHZuCTdnOkdm6639mfjRQO1oUWiBy+NZfIKc1CX2/9RR9cDbRdpOYZwqa+F/Em
         35xR1zHP0xNqIQrH4HrTHsimEW8JiUwpcky2OTBX74zjx0lk1t2i1cyNiDUo2vE6FrAT
         xREEo7h9ZD2O0dtW6Sm5Mdgf8CblxI/OsBl3idMI9YxdSm4QWb7uKwd8GjAy+4xD3Hcp
         pZCoXEDTZkJBzdDmtt1QYSAit10qqj3KPNg9B56FDrRDEkPmd83U6tuROElr26q+f9Nf
         dLdy66quUATBiUMMOIEhY8gqE/DWs86/ugTEqSKA/PuUsWu3dNedckP3x9eFxq89wg1I
         2lyg==
X-Gm-Message-State: AOAM533izL1H4ij3+qE1FrZSFMzpgnEsDmtyVhiecB4ldkpnV3Up07in
        pjrP+LeKMmsRPmOhEF3lAdjCn1Nb
X-Google-Smtp-Source: 
 ABdhPJx8gt7tuqiSW8O4VvHfcIR0Nb+yggIi4cg1z0ylMh8Sl4zH2j175qnMRLni1iVnZykCdNjMxw==
X-Received: by 2002:adf:9447:: with SMTP id 65mr7773601wrq.331.1590742277996;
        Fri, 29 May 2020 01:51:17 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.16
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:17 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 14/34] commit-graph: check chunk sizes after writing
Date: Fri, 29 May 2020 10:50:18 +0200
Message-Id: <20200529085038.26008-15-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

In my experience while experimenting with new commit-graph chunks,
early versions of the corresponding new write_commit_graph_my_chunk()
functions are, sadly but not surprisingly, often buggy, and write more
or less data than they are supposed to, especially if the chunk size
is not directly proportional to the number of commits.  This then
causes all kinds of issues when reading such a bogus commit-graph
file, raising the question of whether the writing or the reading part
happens to be buggy this time.

Let's catch such issues early, already when writing the commit-graph
file, and check that each write_commit_graph_*() function wrote the
amount of data that it was expected to, and what has been encoded in
the Chunk Lookup table.  Now that all commit-graph chunks are written
in a loop we can do this check in a single place for all chunks, and
any chunks added in the future will get checked as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index fe26245fad..e3adffd58b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1474,11 +1474,22 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			progress_title.buf,
 			chunks_nr * ctx->commits.nr);
 	}
-	for (i = 0; i < chunks_nr; i++)
+	chunk_offset = f->total + f->offset;
+	for (i = 0; i < chunks_nr; i++) {
+		uint64_t end_offset;
+
 		if (chunks[i].write_fn(f, ctx)) {
 			ret = -1;
 			goto cleanup;
 		}
+
+		end_offset = f->total + f->offset;
+		if (end_offset - chunk_offset != chunks[i].size)
+			BUG("expected to write %lu bytes to chunk %08x, but wrote %lu instead",
+			    chunks[i].size, chunks[i].id,
+			    end_offset - chunk_offset);
+		chunk_offset = end_offset;
+	}
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 

From patchwork Fri May 29 08:50:19 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578141
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A7D4F912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:30 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 6D7DC208A7
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:30 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="gt+X7fiI"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726931AbgE2Iva (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:30 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44404 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726064AbgE2Iv0 (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:26 -0400
Received: from mail-wm1-x341.google.com (mail-wm1-x341.google.com
 [IPv6:2a00:1450:4864:20::341])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4F575C08C5C6
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:25 -0700 (PDT)
Received: by mail-wm1-x341.google.com with SMTP id v19so2339444wmj.0
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=SgCDjDXcRZ3DF8Ikw8mIO/UrMskL6fM/dl5eNBNLf/k=;
        b=gt+X7fiI6gnS+mC/IwtQ8ud0Y0INz9O3PV/7OWdT2zmxguQYq5vS5L3YWV2GUTku4s
         OUI51ErgHCrKcfXKveYUiSb9tRj5PnYZTt8VZTPpfgTah+E310SfF5pe8YtNMDx1dew3
         2GTkxcdz/H6tJqaAjwSJ0/vq3ZGH7oGspDlwHxt6oppkgylIueV1uoW5WjvBu5tXENf8
         vRJEr571TzMAYSxt+TIF0eur+QJ4PIicYTl7wWXuqqR7mMwOGoHQCsgQlzbSnbIPCTBZ
         Z2S7o+W0i5Is51j8mDWbq0WoiFSVN9xd/okjbDoPn4EC9M40hYdsyEo2GRsnw2UoB2GJ
         DIBw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=SgCDjDXcRZ3DF8Ikw8mIO/UrMskL6fM/dl5eNBNLf/k=;
        b=IqFEFWIWa4FPtI5MIuO7EJeKn1Vg1ZASQB7ocvgyO1LMGf2DMOcrmcZecped5DrXLz
         UREJ1ua0Kqm7FA7457X0InGixCGH3YBEjWBDb+aXOEiq0jbr1c/b4MJnwcde9Hlwd60B
         13ubdU/W8xhiCZaWM8xXhdeS487zr6FnWNDdUglzHTfIqnIvNOitmXU2gPIlaptjVMD/
         H18hgRNDRJsg/My03M4h5dtN/5QAp0MHQ/Xo2cWj4c0+I6FeBHujRlBp2EQjYwRacVZm
         DEvQQgwcGF/Qd0E2tkLS8sJJP6CPkz0CwwyK9dbfeZITXQRzsIGNcVIl7iHM28iBXwTU
         sb5g==
X-Gm-Message-State: AOAM530zz92+BpaZEL09kZPLFkhJoDmBpF6Vm0DMh2B1vu70J0mUfO0b
        a7LLrmCG5CyZvwPukffxPCZ6Wjao
X-Google-Smtp-Source: 
 ABdhPJy91kBn1gCpmWyn2BpmEesEp9aRD66fghambCgLMPjpxNwp9TqVmSRAvQNCyWZqa+ClX1BcpA==
X-Received: by 2002:a1c:bd86:: with SMTP id n128mr7912174wmf.5.1590742279679;
        Fri, 29 May 2020 01:51:19 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.18
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:18 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 15/34] commit-graph-format.txt: document the modified path
 Bloom filter chunks
Date: Fri, 29 May 2020 10:50:19 +0200
Message-Id: <20200529085038.26008-16-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

During a pathspec-limited revision walk, e.g. 'git log -- dir/file',
Git spends a significant part of its runtime in the tree-diff
machinery, checking whether the given path was modified between
subsequent commits.  By using Bloom filters to store the paths
modified by each commit we can quickly tell that a commit didn't
modify the given path without invoking tree-diff, thus considerably
reduce this diffing overhead, along with the runtime and memory usage.
This patch extends the commit-graph file format with additional chunks
to store those modified path Bloom filters.

The rest of this log message takes a closer look at the problem and
explains why these new chunks look like they do.

In the following the terms "file" and "path" are not used
interchangeably.  If a commit modified 'dir/subdir/foo.c', then it
modified that one file, but it modified three paths (namely 'dir',
'dir/subdir', and 'dir/subdir/foo.c').

Furthermore, unless otherwise noted, "a lot of paths" means 5000
randomly selected paths that have existed at one time or another
during the history of the corresponding repository's default branch
[1], except for git it means all 5089 paths that have ever existed
during the history of v2.25.0, and for homebrew-cask and homebrew-core
all 7307 and 6535 paths, respectively, that have ever existed in the
history of their default branches some time ago.

About pathspec-limited revision walks
-------------------------------------

So, during a pathspec-limited revision walk Git spends a significant
part of its runtime in the tree-diff machinery, checking whether the
given path was modified between subsequent commits.  The table below
shows the average runtime of 'git rev-list HEAD -- $path' for "a lot
of paths" in a couple of repositories, and how much of that runtime is
spent in diff_tree_oid().  It also shows the potential average
speedup, should we be able to reduce the tree-diff overhead to zero
without introducing some other overhead (spoiler alert: we won't be
able to achieve that, of course, but still will achieve even higher
average speedup in several cases).

                 Average    Average            Potential
                 runtime   diff time            speedup
  -------------------------------------------------------
  android-base   0.8780s    0.7320s    83.37%     6.14x
  cmssw          0.3143s    0.2800s    88.95%     9.19x
  cpython        0.7453s    0.6602s    88.58%     8.76x
  elasticsearch  0.1492s    0.1351s    90.55%    10.58x
  gcc            7.1852s    6.9432s    96.63%    29.69x
  gecko-dev      4.6113s    4.0964s    88.83%     8.96x
  git            0.6180s    0.5911s    95.65%    22.97x
  glibc          0.5618s    0.5313s    94.57%    18.42x
  go             0.4913s    0.4469s    90.96%    11.07x
  jdk            0.0482s    0.0431s    89.42%     9.45x
  linux          0.7043s    0.6163s    87.50%     8.00x
  llvm-project   2.6844s    2.2607s    84.22%     6.34x
  rails          0.2784s    0.2372s    85.20%     6.76x
  rust           0.7757s    0.7349s    94.74%    19.01x
  tensorflow     0.6258s    0.5735s    91.64%    11.96x
  webkit         1.9137s    1.6580s    86.64%     7.48x

Notice that the average time spent diffing in the Linux and Git
repositories is quite close (0.6163s vs 0.5911s), although the Linux
repository contains about 15x more commits and 25x more files; more on
this later.

Instrumenting the tree-diff machinery and gathering some stats about
the number of diff calls, tree object comparisons and decoded tree
entries revealed some interesting, perhaps even counterintuitive
things:

  - The number of scanned tree entries can be more important than the
    number of tree comparisons.

    Here are four paths from two repositories, two each in the same
    frequently-modified directory, with one near the beginning and one
    at the end of that directory.  When listing the commits modifying
    these paths, i.e. 'git rev-list HEAD -- $path', the number of tree
    comparisons is (nearly) the same, but the number of scanned tree
    entries differs by over two orders of magnitude.

                                      Average   Nr of scanned   Nr of tree
      Repository          $path      diff time   tree entries  comparisons
      --------------------------------------------------------------------
      git            .clang-format     0.188s          40115       18055
      git            zlib.c            0.729s       10705983       17120
      homebrew-core  Formula/a2ps.rb   8.390s        2122937      326758
      homebrew-core  Formula/zzz.rb   80.495s     1235547836      326758

    This is also noticeable when looking at the average number of
    scanned tree entries and tree comparisons when running 'git
    rev-list HEAD -- $path' for "a lot of paths":

                                Average nr      Average      Average
                     Average    of scanned     nr of tree   nr of diff
                    diff time  tree entries   comparisons      calls
      ----------------------------------------------------------------
      jdk            0.0431s       204466         5520         3702
      elasticsearch  0.1351s       720393        17212        13457
      rails          0.2372s      1299221        43369        33670
      cmssw          0.2800s      1938067        24798        23739
      go             0.4469s      3207155        76432        42346
      glibc          0.5313s      6555231        40442        29208
      tensorflow     0.5735s      3914851        97227        47262
      git            0.5911s      7480371        21789        18168
      linux          0.6163s      4004067        79514        58145
      cpython        0.6602s      4695754        88408        70439
      android-base   0.7320s      4312043       122280        96376
      rust           0.7349s      5084683        68110        29847
      webkit         1.6580s      9395349       303105       218906
      llvm-project   2.2607s     11773792       469801       330250
      gecko-dev      4.0964s     42396246       415843       387162
      gcc            6.9432s     84853083       286904       174218

    Note that although we have a lot less tree comparisons in the git
    than in the linux repository, it has almost double the amount of
    scanned tree entries, because the git repository has some
    frequently modified biggish directories (notably its root dir, 't'
    or 'Documentation').  As a result the average time spent diffing
    in the two repositories is roughly the same, although the linux
    repository is much larger.

    Or that we spend the most time diffing in the gcc repository,
    because it has by far the most scanned tree entries, even though
    it has less tree comparisons than any of the next three slowest
    repositories.

  - The number of path components in the pathspec, i.e. its depth in
    the directory tree seems to be irrelevant.

    When checking whether a path somewhere deep in the directory tree
    has been modified between a commit and its parent the tree-diff
    machinery can short-circuit, and it returns as soon as it finds
    the first leading directory that hasn't been modified.  And more
    often than not it can short-circuit already while comparing the
    root trees, as all the <1.5 values in the "Average tree
    comparisons per diff call" column show.

                     Average   Average tree     Average      Average
                    pathspec    comparisons    nr of diff   nr of tree
                      depth    per diff call      calls    comparisons
      ----------------------------------------------------------------
      android-base    5.78         1.26          96376       122280
      cmssw           4.28         1.04          23739        24798
      cpython         3.72         1.25          70439        88408
      elasticsearch   9.13         1.28          13457        17212
      gcc             5.13         1.64         174218       286904
      gecko-dev       6.25         1.07         387162       415843
      git             2.30         1.19          18168        21789
      glibc           4.17         1.38          29208        40442
      go              4.60         1.80          42346        76432
      jdk             8.45         1.49           3702         5520
      linux           4.49         1.36          58145        79514
      llvm-project    5.45         1.42         330250       469801
      rails           5.43         1.28          33670        43369
      rust            4.70         2.28          29847        68110
      tensorflow      5.56         2.06          47262        97227
      webkit          6.33         1.38         218906       303105

    Note the 2.xy average tree comparisons per diff call in the rust
    and tensorflow repositories.  In the rust repository over 98% of
    the paths are in the 'src' directory and over 93% of commits
    modify a file under that directory, while in the tensorflow
    repository over 92% of paths are in the 'tensorflow' directory and
    over 90% of commits modify a file under that directory.
    Consequently, the tree-diff machinery can only rarely
    short-circuit while comparing the root trees of two subsequent
    commits, but has to dive down and compare the contents of the
    'src' or 'tensorflow' directories most of the time as well.  I
    suspect that we would get a similar ~2 value in the homebrew-core
    repository as well, because over 99.7% of all commits modify the
    'Formula' directory, which contains over 90% of paths in the
    repository.

Bloom filters intro
-------------------

Quoting the (quite good) Wikipedia article, "A Bloom filter is a
space-efficient probabilistic data structure [...] that is used to
test whether an element is a member of a set. False positive matches
are possible, but false negatives are not – in other words, a query
returns either 'possibly in set' or 'definitely not in set'".

A Bloom filter is a bit array, initially all bits set to 0.  To add an
element to a Bloom filter, the element is hashed using 'k' independent
hash functions, the resulting hashes are turned into 'k' array
positions, and the bits at those positions in the filter are set to 1.
To query for an element, the element is hashed using the same 'k' hash
functions, the resulting hashes are turned into 'k' array positions,
and the values of the bits at those positions are checked.  If all
those bits are set, then the element is 'possibly in set'.  If even
one of those bits is unset, then the element is 'definitely not in
set'.

Some of the for us relevant properties of Bloom filters are:

  - A Bloom filter doesn't store the elements themselves, it only sets
    a couple of bits based on the elements' hashes.  Consequently, a
    Bloom filter can't tell which elements it contains; it can't even
    tell the number of elements in there.

  - A bit in the Bloom filter's bit array might be set by more than
    one element.  This is where the probabilistic nature and the
    possibility of false positives come from: it is possible that some
    elements in the set happen to have set all 'k' bits that would
    indicate the presence of a particular element, even though that
    element itself is not part of the set.

  - Elements can't be removed from a Bloom filter: the 'k' bits
    indicating the presence of an element can't simply be unset,
    because that might unset bits that were set by other elements as
    well.
    There are enhanced Bloom filter variants that allow removal of
    elements, like a counting Bloom filter, but they come with
    additional complexity and require considerably more space.

  - Bloom filters can't be resized, because turning the hashes into
    positions in the bit array critically depends on the size of the
    bit array (usually pos[i] = hash[i] % bit_array_size).

  - Bloom filters of the same size can be bitwise OR-ed to create the
    union of the sets of elements in the filters.

  - When playing long enough with probabilities and making some
    reasonable assumptions and approximations it can be shown (see the
    Wikipedia article) that for a desired false positive probability
    'p' there is an optimal number of 'k' hash functions:

      k = -log₂(p)    (that's a base 2 logarithm there)

    For 1% false positive probability k = 6.64, but of course there
    can only be an integral number of hash functions, so 7 it is.

    Note that this value is independent both from the size of the
    Bloom filter and from the number of elements in there.

    Inversely, when using a properly sized bit array, then the false
    positive probability falls exponentially as 'k' increases.

  - To store 'n' elements in a Bloom filter with a desired false
    positive probability using the optimal number of 'k' bits/hash
    functions, we need a bit array with approximate size 'm':

      m ≈ n * k / ln(2) ≈ n * k * 10 / 7

    When using 7 hash functions to aim for < 1% false positive
    probability this simplifies down to:

      m ≈ n * 7 * 10 / 7 = 10 * n

    i.e. approx. 10 bits per element.

  - In general the more elements, IOW the more set bits there are in a
    Bloom filter of certain size, the higher the probability of false
    positives.  Similarly, the larger the size of a Bloom filter's bit
    array for a certain number of elements, the lower the probability
    of false positives.

  - A Bloom filter with all bits set appears to contain "everything",
    because all queries will return 'possibly in set'.

Modified Path Bloom Filters
---------------------------

We'll use Bloom filters to store modified paths and will store them in
the Modified Path Bloom Filters chunk.  Yes, plural: we'll use a lot
of small Bloom filters, each one storing all paths modified by one
commit compared to one of its parents (see the Alternatives section
near the end for reasons for not using one big Bloom filter).  Then as
the pathspec-limited revision walk iterates over all commits it will
query each modified path Bloom filter to see whether the given
pathspec is in there, and if it is 'definitely not in set', then we
can spare a tree-diff call and skip the commit right away, but if it's
'possibly in set', then we must do the tree-diff to make sure.

Each modified path Bloom filter consists of:

  - 4 bytes specifying the number of bits in the Bloom filter's bit
    array.

    For practical purposes 32 bit is more than sufficient to store the
    number of bits in the Bloom filter's array.  When using k = 7
    hashes, i.e. 10 bits per path, then we could store over 400
    million paths in a single Bloom filter; with k = 32 hashes we'd
    use 46 bit per path, which is still over 93 million paths.

  - The actual bit array.  See the next and the Hashing scheme
    sections for details about how this bit array is used.

All modified path Bloom filters will use the same k number of hashes
per path, so we won't have to store that in each Bloom filter.  I
suspect that having modified path Bloom filters with different k
number of hashes wouldn't bring enough benefit to justify storing that
value in each and every filter.

The order of modified path Bloom filters in this chunk is unspecified
on purpose, so implementations can experiment with writing them in
history or topological order, which may bring performance benefits
through better locality.

[Somewhere in the middle of this section the length of this commit
message surpassed the length of 72441af7c4 (tree-diff: rework
diff_tree() to generate diffs for multiparent cases as well,
2014-04-07), the mother of all commit messages from the great Kirill
Smelkov, yay! :)]

Modified Path Bloom Filter Index
--------------------------------

Since modified path Bloom filters vary in size, we can't have an array
of modified path Bloom filters indexed by the position of the commit
OID in the OID Lookup chunk, but we'll need a level of indirection
between the commit OID position and the position of the associated
modified path Bloom filter.

So the Modified Path Bloom Filter Index chunk will contain an array of
offsets to quickly map commit OID positions to offsets in the Modified
Path Bloom Filters chunk, i.e. its Nth entry contains the offset for
the commit whose OID is Nth in the OID Lookup chunk.

Since the commit-graph file can store information about 2^31-1
commits, a merely 4 byte offset per commit is clearly insufficient.
However, surely noone want to have lots of gigabyte-sized modified
path Bloom filters, so a standard 8 byte file offset will be
underutilized.  This allows us to use a few bits from those 8 bytes
for special purposes.  For now we'll have two special purpose bits:

  - The most significant bit indicates that the entry is not an offset
    into the Modified Path Bloom Filters chunk, but an "embedded"
    modified path Bloom filter containing all paths modified by the
    commit compared to its first parent; see the next section.

  - The second most significant bit indicates that modified path Bloom
    filters are stored not only for the first parent of a merge
    commit, but for all its parents, and the entry is neither an
    offset nor an "embedded" modified path Bloom filter, but an array
    index into the Modified Path Bloom Filter Merge Index chunk; see
    the "Merges" section below.

  - [TODO: perhaps we might want to reserve one or two more bits for
    special purposes?]

Make these offsets relative to the start of the Modified Path Bloom
Filters chunk, so they will depend neither on the number of chunks nor
on the combined size of any other chunks that are written before the
Modified Path Bloom Filters chunk, thus allowing implementations to
calculate all these offsets without knowing anything about those other
chunks.  Furthermore, if the offsets were relative to the beginning of
the file, then some huge chunks could make the offsets grow too large,
and would mess up those special purpose bits (though, arguably, an
exabyte sized commit-graph file is just too unlikely to be worried
about...).

An "all bits set" index entry can be used to indicate that there is no
modified path Bloom filter stored for the corresponding commit.  A
reader implementation can either special-case such an entry, or
interpret it as an embedded modified path Bloom filter that replies
'possibly in set' to all queries, the result is the same: it has to
resort to running tree-diff for that commit.

These embedded modified path Bloom filters have implications on the
low-level Bloom filter format in the Modified Path Bloom Filters chunk
as well, namely:

  - Their array contains only 63 bits, not 64, i.e. not 8 full bytes.
    Therefore, for simplicity, we'll store the size of the bit array
    as the number of bits, not bytes.

  - Assigning special purpose to the most significant bits of the
    index entries is convenient when the entry is an offset into the
    Modified Path Bloom Filters chunk.  OTOH, it makes indexing into
    the Bloom filter's array awkward if we try to treat it like an
    ordinary array, i.e. whose 0th element comes first, because we'd
    have to account for the special purpose bits.  Therefore, we'll
    store the bit array kind of in big endian byte order, i.e. the
    most significant byte first and the 0th byte last.

In addition, this chunk will start with a small header storing a bit
of metadata that applies to all modified path Bloom filters in all
related chunks.  The reason for storing this header in the Modified
Path Bloom Filter Index chunk is that only this chunk is essential to
use modified path Bloom filters (if all commits modify so few files
that their modified path Bloom filters can all be embedded into the
index chunk, then the Modified Path Bloom Filters chunk will contain
nothing and thus can be omitted; e.g. the two homebrew repositories
come quite close to this, as shown below).

This header contains:

  - One byte storing the number of 'k' hashes per path.

    Using a single byte is more than sufficient: as 'k' increases the
    false positive probability falls exponentially, and quickly
    becomes unnecessarily low (i.e. with k = 31 even a full
    commit-graph containing 2^31-1 commits is expected to have only a
    single false positive).

  - [TODO: What else?  We might want to have a version byte, though if
    a format change becomes necessary, then we could simply rename the
    chunks just as well...]

Embedded Modified Path Bloom Filters
------------------------------------

The ideal size of a Bloom filter to store a set of 6 or 7 elements
with 7 hash functions is approximately 60 or 70 bits, respectively.
So in the space of a 8 byte Modified Path Bloom Filter Index entry we
could comfortably store 6 entries by using one bit to indicate that
the entry is not a file offset but an "embedded" modified path Bloom
filter and the remaining 63 bits as the Bloom filter's bit array.

As the table below shows, a significant portion or even the vast
majority of commits modify no more than 6 paths, so we can embed a lot
of modified path Bloom filters into the Modified Path Bloom Filter
Index chunk.  (Also note the unusually large percentage of empty diffs
in the android-base and cpython repositories.)

        Percentage of commits modifying <=N (or =N) paths compared to their first parents
                   0      <=1      <=2      <=3      <=4      <=5      <=6      <=7      <=8
                          (=1)     (=2)     (=3)     (=4)     (=5)     (=6)     (=7)     (=8)
  --------------------------------------------------------------------------------------------
  elasticsearch  0.70%   4.00%    5.67%    8.50%   13.17%   16.55%   18.32%   21.18%   27.47%
                        (3.30%)  (1.67%)  (2.83%)  (4.67%)  (3.37%)  (1.77%)  (2.86%)  (6.29%)
  jdk            0.26%   3.47%   10.60%   13.99%   15.62%   19.02%   26.62%   34.30%   40.57%
                        (3.20%)  (7.14%)  (3.39%)  (1.63%)  (3.40%)  (7.60%)  (7.68%)  (6.27%)
  webkit         0.05%   0.07%    0.77%    2.14%    9.15%   26.66%   38.42%   47.34%   52.83%
                        (0.02%)  (0.71%)  (1.37%)  (7.01%) (17.51%) (11.76%)  (8.92%)  (5.49%)
  android-base  13.20%  13.62%   14.23%   18.55%   20.91%   35.18%   42.32%   50.82%   62.05%
                        (0.42%)  (0.62%)  (4.32%)  (2.36%) (14.28%)  (7.14%)  (8.49%) (11.23%)
  llvm-project   0.12%   0.12%    0.94%    6.45%   25.24%   46.68%   53.60%   60.97%   67.33%
                        (0.00%)  (0.81%)  (5.51%) (18.79%) (21.44%)  (6.92%)  (7.37%)  (6.36%)
  gecko-dev      0.14%   0.96%    1.88%   15.44%   32.42%   46.12%   54.54%   61.37%   66.65%
                        (0.82%)  (0.92%) (13.56%) (16.98%) (13.70%)  (8.42%)  (6.82%)  (5.28%)
  tensorflow     0.09%   1.26%    2.72%    5.00%   26.30%   42.36%   55.17%   63.11%   69.70%
                        (1.17%)  (1.46%)  (2.28%) (21.27%) (16.07%) (12.81%)  (7.94%)  (6.59%)
  rails          0.10%   2.09%    5.79%   16.03%   35.57%   51.47%   58.71%   65.15%   72.96%
                        (1.99%)  (3.70%) (10.23%) (19.54%) (15.90%)  (7.24%)  (6.44%)  (7.82%)
  rust           0.07%   2.20%    5.11%   22.81%   42.35%   52.50%   59.29%   65.33%   70.02%
                        (2.13%)  (2.91%) (17.70%) (19.54%) (10.15%)  (6.79%)  (6.04%)  (4.69%)
  glibc          0.02%   7.33%   14.03%   30.86%   42.22%   52.53%   61.59%   68.50%   73.68%
                        (7.31%)  (6.70%) (16.83%) (11.36%) (10.32%)  (9.06%)  (6.91%)  (5.19%)
  gcc            0.00%   0.24%   10.92%   26.61%   39.85%   54.97%   63.80%   69.37%   76.52%
                        (0.24%) (10.68%) (15.69%) (13.24%) (15.13%)  (8.82%)  (5.57%)  (7.14%)
  go             0.00%   0.96%    9.09%   19.35%   39.97%   53.16%   65.31%   72.10%   77.36%
                        (0.95%)  (8.13%) (10.26%) (20.63%) (13.19%) (12.15%)  (6.79%)  (5.26%)
  cmssw          0.15%   0.19%    0.20%    2.43%   45.35%   56.49%   67.58%   73.17%   77.99%
                        (0.03%)  (0.01%)  (2.23%) (42.92%) (11.15%) (11.08%)  (5.59%)  (4.83%)
  linux          0.01%   0.66%    3.97%   23.49%   46.15%   62.97%   72.79%   79.16%   83.57%
                        (0.65%)  (3.30%) (19.52%) (22.66%) (16.82%)  (9.82%)  (6.37%)  (4.41%)
  cpython        3.07%   4.97%   27.73%   59.54%   70.34%   77.48%   81.91%   86.76%   89.34%
                        (1.91%) (22.75%) (31.82%) (10.80%)  (7.13%)  (4.43%)  (4.85%)  (2.59%)
  git            0.11%  27.54%   55.92%   73.79%   81.90%   87.05%   90.28%   92.29%   93.82%
                       (27.43%) (28.38%) (17.87%)  (8.11%)  (5.15%)  (3.23%)  (2.01%)  (1.53%)
  homebrew-cask  0.40%   0.94%   95.41%   97.42%   98.11%   98.40%   98.61%   98.79%   98.93%
                        (0.54%) (94.46%)  (2.01%)  (0.70%)  (0.29%)  (0.21%)  (0.18%)  (0.14%)
  homebrew-core  0.01%   0.07%   98.81%   99.35%   99.56%   99.75%   99.81%   99.84%   99.86%
                        (0.07%) (98.74%)  (0.53%)  (0.22%)  (0.19%)  (0.06%)  (0.03%)  (0.02%)

This saves space, because the Modified Path Bloom Filters chunk will
contain a lot less Bloom filters (albeit those would be rather small
filters).

It reduces the probability of false positives, because all commits
modifying 1-6 paths will have larger than strictly necessary modified
path Bloom filters.

Finally, it makes the Bloom filter query ever so slightly faster,
partly because there is no redirection into the Modified Path Bloom
Filters chunk, and partly because we can check all bit positions in a
63 bit modified path Bloom filter using a 64 bit mask at once, instead
of checking those bit positions one by one, and we can use the same
mask to check all embedded Bloom filters.

The number of paths that can be stored in a 63 bit Bloom filter
depending on the number of hashes per path:

  k     |  3   |  4   |  5  |  6  |  7  |  8  | 9 - 11 | 12 - 14 | 15 - 22 | 23 - 44
  ------+------+------+-----+-----+-----+-----+--------+---------+---------+--------
  paths | <=14 | <=11 | <=8 | <=7 | <=6 | <=5 |  <=4   |   <=3   |   <=2   |   <=1

Hashing scheme
--------------

We need to map each modified path to 'k' independent bit positions in
the Bloom filters bit array.  We want the hash function and hashing
scheme that results in the lowest false positive rate and has the
lowest probability of "colliding" paths (see below), but it should
still be fast to compute, and should be widely available for
alternative Git implementations.

At this point we don't care about the runtime of pathspec-limited
revision walks here.  Lower false positive rate inherently leads to
lower runtime, though we are reaching diminishing returns on common
setups as the false positive rate gets lower and lower... and e.g. in
the webkit repository we'll reach an average false positive rate of
~0.001% with only k = 7 hashes per path.  However, on unusual
setups/configurations accessing tree objects might be considerably
more expensive than accessing commit objects, especially when using
only commit info stored in the commit-graph.  E.g. consider a future
where we can distribute commit-graphs with modified path Bloom filters
to partial clones containing only commit objects for most of the
history: any tree-diff will be really expensive, because the tree
objects must be fetched from the promisor.  In such a setup every
avoidable tree-diff call counts, and low false positive rate is
king.

For now I went with 32 bit MurmurHash3 used in enhanced double hashing
scheme with 32 bit unsigned integer arithmetic, though as I will show
below it seems that this is not the best option.

So, to map each modified path to 'k' bit positions in the Bloom
filter's array we first need 'k' independent hashes.  In general,
hashing a path 'k' times with the same hash function but using 'k'
different seeds produces hashes that are independent enough.  In
practice, to reduce the overhead of hashing, especially for larger 'k'
values, some variant of double hashing is often used to generate the
'k' independent-ish hashes from the results of only two hash function
calls with different seeds.  In general:

  h1 = h(seed1, path)
  h2 = h(seed2, path)
  for (i = 0; i < k; i++)
      pos[i] = (h1 + i * h2 + f(i)) % nr_bits

Depending on how the f(i) term is defined there are a few named
variants:

  - Double hashing: f(1) = 0:

      pos[i] = (h1 + i * h2) % nr_bits

  - Improved double hashing: f(i) adds a simple quadratic term:

      pos[i] = (h1 + i * h2 + i^2) % nr_bits

  - Enhanced double hashing: f(i) adds a not-so-simple cubic term:

      pos[i] = (h1 + i * h2 + (i^3 - i) / 6) % nr_bits

    This cubic term is equal to the following sequence of numbers,
    starting from i = 0:

      0, 0, 1, 4, 10, 20, 35, 56, 84, 120, etc.

    These are almost the same as tetrahedral numbers, except that the
    mathematical definition starts with 1, 4, 10..., while the OEIS
    A000292 sequence starts with 0, 1, 4, 10...

    I'm puzzled by this term being 0 not only for i = 0, but for i = 1
    as well, because this means that if (h2 % nr_bits == 0) (see
    below), then both pos[0] = h1 and pos[1] = h1, IOW in that case
    the path has one less bit set.  We'll take a look at whether
    starting the sequence with a single 0 makes a difference (it does)
    and how that affects the false positive rate (slightly increases
    it overall), and this will be labeled "enhanced double hashing
    variant #1" below.

    This sequence is supposed to be just as good as a simple i^3 term,
    but this can easily be calculated incrementally without
    multiplication, using only addition.  We'll take a look at how a
    simple cubic term affects the false positive rate (it increases
    it), and this will be labeled "enhanced double hashing variant
    #2" below.

These hashing schemes usually work fairly well in practice, but in
practice Bloom filters are primarily used to check whether an element
is part of _one_ _big_ set, and, consequently, most of the wisdom and
experience out there applies to big Bloom filters.

We, however, will repeatedly check whether a particular element (i.e.
path) is part of a large number of mostly very small sets, and our use
case does challenge those best practices.

In our use case it is important when two paths "collide", i.e. when
one path in itself sets all the bit positions that (falsely) indicate
the presence of another path in a modified path Bloom filter, so let's
have a look at that.  And let's hope against hope that I don't mess up
the math...

So, if we have k truly independent hashes for each path, then:

  (1) The probability of two paths mapping to the same k separate bit
      positions in a Bloom filter of size nr_bits is nr_bits! /
      ((nr_bits - k)! * k!).

  (2) The probability of a path setting only a single bit position is
      1 / nr_bits^(k-1), and the probability of a path setting one
      particular bit position is k / nr_bits (assuming that it does
      set k separate bits), so the probability of a path setting a bit
      position that happens to be the only bit position set by an
      other path is k / nr_bits^k.

In case of our 63 bit embedded modified path Bloom filters and k = 7
the probabilities of these two cases are about 1.81 * 10^(-9) and
1.77 * 10^(-12), respectively.  I think that these are low enough not
to worry about.

However, the hashes in the various double hashing schemes are far from
being independent.  When starting out with unsigned 32 bit hashes
(like we'll do) and using 64 bit arithmetic, then there is no integer
overflow, because neither i nor f(i) are large enough for that.  Then,
due to the distributivity of the modulo operation (and because i <
nr_bits), all double hashing schemes are equivalent to:

  pos[i] = (h1 % nr_bits + i * (h2 % nr_bits) + f(i) % nr_bits) % nr_bits

For the above mentioned two colliding cases this means:

  (1) if (h(seed1, "foo") % nr_bits == h(seed1, "bar") % nr_bits &&
          h(seed2, "foo") % nr_bits == h(seed2, "bar") % nr_bits)

      then all double hashing variants will work as if both paths were
      hashed to the same h1 and h2 values, and, consequently, both
      paths will be mapped to the same bit positions.  This has the
      probability of 1 / nr_bits^2.

  (2) if (h2 % nr_bits == 0)

      then the i * h2 term will basically amount to nothing and this
      term can be simplified away, and all bit positions will depend
      only on h1; this has the probability of 1 / nr_bits.

      In case of double hashing the f(i) term is 0 as well, meaning
      that the path maps to a single bit position.  The probability of
      a path setting a bit position that happens to be the only bit
      position set by an other path is k / nr_bits^2.

      The quadratic or cubic f(i) terms in improved or enhanced double
      hashing ensure that a path is mapped to multiple bit positions
      even in this case, though those bit positions are not nearly as
      random as one would like.

In case of 63 bit Bloom filters and k = 7, the probabilities of these
two cases are 1 / 3969 and 1 / 567, respectively.

When using 32 bit unsigned integer arithmetic, then an integer
overflow is definitely possible, so the double hashing formula
becomes:

  pos[i] = (h1 + i * h2 + f(i)) % 2^32 % nr_bits

If nr_bits is a power of two, then the "% 2^32" term can be simplified
away, and we end up with the same formula as with 64 bit arithmetic,
and the probabilities of cases (1) and (2) above remain the same.

If nr_bits is not a power of two, then...  well, I don't offhand know
how to approach that formally :)  Anyway:

  (1) This type of collision seems to occur if that two-liner
      condition above is true and for both paths there is an overflow
      for the same values of i, which has the approximate probability
      of 1 / ((k - 1) * nr_bits^2).

  (2) This type of collision seems to occur if (h2 % nr_bits == 0) and
      there is either no integer overflow for any values of i or if
      there is an overflow for all values of i, which has the
      approximate probability of k / ((k - 1) * nr_bits^2).

In case of 63 bit Bloom filters and k = 7 the probabilities of these
two cases are 1 / 23814 and 1 / 3402, respectively.

There are several other colliding cases, e.g. with double hashing
variants its more probable that a path maps only to 2, 3, etc. bit
positions instead of 'k' than with 'k' truly independent hashes,
though I haven't looked into how the probabilities work out.

Anyway, all the above already shows that:

  - These colliding cases are several orders of magnitude more likely
    with any double hashing variant than with k truly independent hash
    functions.

  - When using some sort of double hashing, then these colliding cases
    can happen with a high enough probability that we can't just
    ignore them.

  - These colliding cases can happen much more frequently with double
    hashing than with improved or enhanced double hashing.

  - Using 64 vs 32 bit arithmetic while calculating various double
    hashing schemes makes a difference, and suggests that 32 bit
    arithmetic has lower false positives probability.

  - Bloom filters whose size is a power of two might have higher false
    positive probability.

All this is important, because there are repositories out there that
modify the same path in the majority of commits, e.g.:

  - homebrew-core: contains 6535 paths, and over 99.5% of commits
    modify the 'Formula' directory and have an embedded modified path
    Bloom filter.

  - homebrew-cask: contains 7307 paths, and over 95.5% of commits
    modify the 'Casks' directory and have an embedded modified path
    Bloom filter.

  - rust: contains 58k+ paths, and almost 93% of commits modify the
    'src' directory, though only over 54% of commits modify that
    directory and have an embedded modified path Bloom filter.

  - tensorflow: contains 47k+ paths, and almost 91% of commits modify
    the 'tensorflow' directory, though only about 51% of commits
    modify that directory and have an embedded modified path Bloom
    filter.

  - go: contains 22k+ paths, and almost 84% of commits modify the
    'src' directory, though only over 50% of commits modify that
    directory and have an embedded modified path Bloom filter.

So e.g. if we were to look for commits modifying a path in the
homebrew-core repository, which happens to map to the same bit
positions in a 63 bit Bloom filter as the 'Formula' directory, then
Boom! we would get over 99.5% false positive rate, effectively
rendering modified path Bloom filters useless for that particular
path.

The table below shows the number of paths that happen to collide with
the repository's frequently modified directory in embedded modified
path Bloom filters using different hash functions and hashing schemes
with 32 bit unsigned integer arithmetic and 64 bit arithmetic (the
latter in parentheses):

                |    Double hashing   |      Enhanced     |      7 seeds
                |                     |   double hashing  |
                |  Murmur3 |  xxHash  | Murmur3 |  xxHash | Murmur3 | xxHash
  --------------+----------+----------+---------+---------+---------+--------
  homebrew-core |  4  (15) |  5  (17) |  0  (0) |  0  (2) |    0    |    0
  homebrew-cask |  2  (21) |  6  (14) |  0  (2) |  1  (1) |    0    |    0
  rust          | 18 (143) | 38 (144) |  1 (15) |  6 (24) |    0    |    0
  tensorflow    | 20 (110) | 18 (105) |  4 (15) |  0 (12) |    0    |    0
  go            |  9  (66) | 12  (62) |  0  (1) |  3  (8) |    0    |    0
  --------------+----------+----------+---------+---------+---------+--------
  all           | 53 (355) | 79 (342) |  5 (33) | 10 (47) |    0         0

The effect of embedded modified path Bloom filters on these colliding
cases can be both beneficial and harmful:

  - We use a larger than necessary Bloom filter to hold 1-6 paths,
    which lowers the probability of these cases considerably.

    This is especially important for the homebrew repositories.  E.g.
    in homebrew-core over 98.6% of commits modify a single file in the
    'Formula' directory, i.e. two paths in total.  To store 2 paths
    using 7 hashes per path we would need a Bloom filter with a 20 bit
    array, which we would round up to 24 bits to use full bytes, i.e.
    those 98.6% of commits would have merely 24 bit Bloom filters.
    This makes those colliding cases all the more probable: 1 / 3456
    for case (1) and 1 / 493.7 for case (2) with 32 bit arithmetic).

  - We use Bloom filters of the same size to hold 1-6 paths, so if a
    path were to run into these cases, then more Bloom filter queries
    would return false positives than when using appropriately sized
    Bloom filters.

Now, there is a simple trick that can, to some extent, alleviate these
collision issues: check all leading directories of the pathspecs, i.e.
while looking for commits modifying 'dir/subdir/file', then query the
modified path Bloom filters not only for the full path, but for all
its leading directories 'dir/subdir' and 'dir' as well.  This way we
would only get a false positive if all bit positions of all leading
directories were set as well, which can significantly reduce the
probability of a pathspec running afoul of these colliding cases, and,
in general, can reduce the false positive rate by an order of
magnitude or three as well (see later in this patch series).

Checking all leading directories is not a silver bullet, though,
because it can only help if the pathspec does actually have leading
directories, and the deeper the pathspec in the directory tree, i.e.
the more leading directories it has, the lower the false positive
rate becomes.

  - So this doesn't help pathspecs in the root tree, because they
    obviously don't have any leading directories.

  - Furthermore, this doesn't help pathspecs that are immediately
    below such a frequently modified directory, because their only
    leading directory is modified in the majority of commits.

    This means that it can't help in the homebrew repositories,
    because 85% or 90% of their paths are directly under their
    frequently modified directories.

The only thing that can help even in these cases is hashing the paths
k times using k different seeds.

Phew.  Let's see some actual benchmarks, shall we?

The following tables in this section show the average false positive
rate of various hash functions and hashing schemes while listing the
histories of "a lot of paths", i.e. 'git rev-list HEAD -- $path'.  A
'*' denotes the lowest value in each row.  All cases use k = 7 hashes
per path, use the same basically random seeds of immense historical
significance, store 6 path in embedded modified path Bloom filters,
and check all leading directories of the given pathspec.

The table below compares the average false positive rate of 64 bit and
32 bit unsigned integer arithmetic using double hashing and enhanced
double hashing with the MurmurHash3 hash function:

                    Double hashing     | Enhanced double hashing
                   32 bit     64 bit   |    32 bit     64 bit
  -------------------------------------+------------------------
  android-base   0.008539%  0.015709%  | *0.004155%  0.004961%
  cmssw          0.006022%  0.013334%  | *0.003953%  0.004109%
  cpython        0.036840%  0.079816%  | *0.016607%  0.019330%
  elasticsearch  0.004069%  0.005297%  |  0.003249% *0.003101%
  gcc            0.016982%  0.034096%  | *0.008919%  0.011152%
  gecko-dev      0.001058%  0.002691%  | *0.000725%  0.000829%
  git            0.144256%  0.331346%  | *0.069921%  0.079405%
  glibc          0.026480%  0.053838%  | *0.016389%  0.017902%
  go             0.021930%  0.050348%  | *0.012616%  0.014178%
  homebrew-cask  0.097523%  0.508175%  | *0.009096%  0.042034%
  homebrew-core  0.120860%  0.556014%  | *0.005360%  0.026810%
  jdk            0.006085%  0.007526%  |  0.006431% *0.005911%
  linux          0.010908%  0.019081%  | *0.007494%  0.007896%
  llvm-project   0.006417%  0.009327%  | *0.003913%  0.004050%
  rails          0.024997%  0.046829%  |  0.013134% *0.012361%
  rust           0.038579%  0.056852%  | *0.025509%  0.027068%
  tensorflow     0.013732%  0.023307%  | *0.008243%  0.008848%
  webkit         0.002212%  0.002950%  | *0.001007%  0.001065%
  -------------------------------------+------------------------
  all            0.028395%  0.110075%  | *0.006085%  0.006675%
  w/o homebrew   0.010940%  0.021286%  | *0.005968%  0.011023%

The 64 bit unsigned arithmetic does indeed fare worse in almost every
case, and significantly worse in the two homebrew repositories.

So it seems that if using some form of double hashing, then 32 bit
unsigned integer arithmetic is the way to go, even though several
programming languages lack support for unsigned types (though thanks
to the distributivity of the modulo operation, they can simply and
cheaply implement it using 64 bit arithmetic and a '& (2^32-1)' mask).

The table below compares the average false positive rate of various
hashing schemes using 32 bit unsigned integer arithmetic and the
MurmurHash3 hash function:

                              Enhanced   Enhanced   Enhanced
                               double     double     double    Improved
                               hashing    hashing    hashing    double     Double
                   7 seeds   (original)  variant 1  variant 2   hashing    hashing
  ---------------------------------------------------------------------------------
  android-base    0.004214%  *0.004155%  0.004853%  0.005193%  0.008252%  0.008539%
  cmssw          *0.003344%   0.003953%  0.003546%  0.004732%  0.004226%  0.006022%
  cpython         0.016120%   0.016607% *0.015896%  0.020259%  0.020311%  0.036840%
  elasticsearch  *0.003167%   0.003249% *0.003164%  0.003531%  0.003553%  0.004069%
  gcc             0.010281%  *0.008919%  0.010359%  0.010642%  0.012166%  0.016982%
  gecko-dev       0.000804%   0.000725% *0.000646% *0.000648%  0.000822%  0.001058%
  git            *0.063025%   0.069921%  0.067643%  0.080744%  0.091423%  0.144256%
  glibc          *0.016278%   0.016389%  0.018660%  0.021094%  0.025596%  0.026480%
  go              0.013009%  *0.012616%  0.013301%  0.014316%  0.016345%  0.021930%
  homebrew-cask  *0.007199%   0.009096%  0.009070%  0.010722%  0.016617%  0.097523%
  homebrew-core  *0.003041%   0.005360%  0.005277%  0.005148%  0.016150%  0.120860%
  jdk             0.005873%   0.006431% *0.005326%  0.005955%  0.006984%  0.006085%
  linux           0.007764%  *0.007494%  0.007714%  0.008709%  0.009429%  0.010908%
  llvm-project   *0.003367%   0.003913%  0.003730%  0.004064%  0.004809%  0.006417%
  rails          *0.012708%   0.013134%  0.013929%  0.016316%  0.014358%  0.024997%
  rust           *0.024245%   0.025509%  0.025045%  0.028204%  0.032491%  0.038579%
  tensorflow     *0.007907%   0.008243%  0.008670%  0.009913%  0.010115%  0.013732%
  webkit         *0.000999%  *0.001007%  0.001142%  0.001113%  0.001274%  0.002212%
  ---------------------------------------------------------------------------------
  all            *0.005646%   0.006085%  0.006226%  0.006928%  0.009264%  0.028395%
  w/o homebrew   *0.005888%   0.005968%  0.006152%  0.006899%  0.007807%  0.010940%

Double hashing and improved double hashing have higher average false
positive rates than enhanced double hashing; note in particular the
significantly higher false positive rate with double hashing in the
two homebrew repositories.  Enhanced double hashing is only slightly
worse than 7 different seeds, at least in this particular case (i.e.
MurmurHash3 and these specific seeds).

Comparing these hashing schemes using a different hash function
(xxHash or FNV1a) shows a similar trend; for brevity I won't include
those tables here.

The table below compares the average false positive rate of different
32 bit hash functions when used in enhanced double hashing scheme with
k = 7 and 32 bit unsigned int arithmetic, or with 7 different seeds:

                     Enhanced double hashing     |             7 seeds             |  7 uint32
                  Murmur3     xxHash     FNV1a   |  Murmur3     xxHash     FNV1a   |   SHA256
  -----------------------------------------------+---------------------------------+-----------
  android-base   0.004155%  0.005556%  0.006113% | 0.004214% *0.004101%  0.005918% | 0.004168%
  cmssw          0.003953%  0.003775%  0.005087% | 0.003344%  0.003677%  0.004353% |*0.003322%
  cpython        0.016607%  0.015919%  0.021957% | 0.016120% *0.015238%  0.023649% | 0.017632%
  elasticsearch  0.003249%  0.003589%  0.004616% | 0.003167%  0.003360%  0.003586% |*0.003027%
  gcc           *0.008919%  0.009376%  0.010555% | 0.010281%  0.009157%  0.011882% | 0.010338%
  gecko-dev      0.000725%  0.000721%  0.000932% | 0.000804% *0.000611%  0.000911% | 0.000737%
  git            0.069921%  0.063449%  0.083097% |*0.063025%  0.069137%  0.087132% | 0.063763%
  glibc          0.016389%  0.017477%  0.024321% | 0.016278% *0.016062%  0.022641% | 0.017241%
  go             0.012616%  0.012449%  0.016728% | 0.013009% *0.011692%  0.017214% | 0.012489%
  homebrew-cask  0.009096%  0.025104%  0.011073% | 0.007199%  0.007343%  0.009037% |*0.007050%
  homebrew-core  0.005360%  0.007940%  0.005450% |*0.003041%  0.003832%  0.004888% | 0.003884%
  jdk            0.006431% *0.005451%  0.007428% | 0.005873%  0.005738%  0.006654% | 0.005749%
  linux         *0.007494%  0.008266%  0.009917% | 0.007764%  0.007723%  0.009256% | 0.007939%
  llvm-project   0.003913%  0.003727%  0.004584% | 0.003367% *0.003247%  0.005115% | 0.003664%
  rails          0.013134%  0.011277%  0.014103% | 0.012708% *0.010389%  0.015928% | 0.012506%
  rust           0.025509%  0.024596%  0.032983% |*0.024245%  0.025697%  0.031593% | 0.024794%
  tensorflow     0.008243%  0.008743%  0.011499% |*0.007907%  0.008089%  0.011922% | 0.008033%
  webkit         0.001007%  0.001155%  0.001401% | 0.000999% *0.000962%  0.001291% | 0.001054%
  -----------------------------------------------+---------------------------------+----------
  all            0.006085%  0.007327%  0.007518% | 0.005646% *0.005554%  0.007591% | 0.005857%
  w/o homebrew   0.005968%  0.005979%  0.007545% | 0.005888% *0.005660%  0.007855% | 0.006039%

MurmurHash3 and xxHash are neck and neck, be it enhanced double
hashing or 7 different seeds, at least when we ignore the unusually
high false positive rate of enhanced double hashing with xxHash in the
homebrew-cask repository (it stumbles upon one of those colliding
cases discussed above).

FNV1a has a decidedly higher average false positive rate than any of
the others.

I was curious to see whether using 7 unsigned integers from SHA256
offers any benefits (being a cryptographic hash function, it should
provide some high quality hash values), but apparently it doesn't fare
any better than MurmurHash3 and xxHash.  This leads me to believe that
both MurmurHash3 and xxHash are as good as it gets, and I would not
expect that any other hash function could achieve notably lower false
positive rates.

Now let's see how these hash functions and hashing schemes fare when
writing commit-graph files with '--reachable' and with modified path
Bloom filters from scratch at the end of this patch series.  Hash
functions tend to praise themselves about how fast they can process
huge chunks of data, but we'll use them to hash many tiny strings...

                         Total runtime of writing a commit-graph file
                         with modified path Bloom filters from scratch
                                    (Time spent hashing)

                     Enhanced double hashing     |             7 seeds             |  7 uint32
                  Murmur3     xxHash     FNV1a   |  Murmur3     xxHash     FNV1a   |   SHA256
  -----------------------------------------------+---------------------------------+----------
  android-base    40.880s    40.368s    40.569s  |  41.375s    40.557s    41.843s  |  42.706s
                  (0.627s)   (0.484s)   (0.777s) |  (1.467s)   (0.886s)   (2.138s) |  (3.682s)
  cmssw           25.691s    25.224s    25.645s  |  26.715s    25.998s    27.762s  |  29.527s
                  (0.894s)   (0.650s)   (1.179s) |  (2.083s)   (1.235s)   (3.259s) |  (5.077s)
  cpython          8.951s     8.929s     9.067s  |   9.072s     9.015s     9.008s  |   9.275s
                  (0.057s)   (0.042s)   (0.045s) |  (0.112s)   (0.067s)   (0.102s) |  (0.324s)
  elasticsearch   14.470s    14.320s    14.703s  |  14.983s    14.588s    15.948s  |  16.720s
                  (0.407s)   (0.300s)   (0.622s) |  (0.991s)   (0.594s)   (1.760s) |  (2.332s)
  gcc             36.917s    36.724s    37.251s  |  37.971s    37.449s    38.178s  |  38.332s
                  (0.313s)   (0.230s)   (0.346s) |  (0.694s)   (0.418s)   (0.919s) |  (1.796s)
  gecko-dev       97.729s    96.791s    97.233s  |  99.215s    97.158s   101.403s  | 105.332s
                  (1.730s)   (1.267s)   (2.099s) |  (3.902s)   (2.344s)   (5.773s) |  (9.553s)
  git              5.245s     5.401s     5.412s  |   5.518s     5.457s     5.474s  |   5.494s
                  (0.022s)   (0.017s)   (0.018s) |  (0.045s)   (0.027s)   (0.040s) |  (0.124s)
  glibc            4.146s     4.156s     4.187s  |   4.267s     4.201s     4.278s  |   4.495s
                  (0.060s)   (0.045s)   (0.057s) |  (0.128s)   (0.079s)   (0.144s) |  (0.331s)
  go               3.565s     3.563s     3.564s  |   3.631s     3.582s     3.607s  |   3.727s
                  (0.040s)   (0.030s)   (0.035s) |  (0.084s)   (0.051s)   (0.084s) |  (0.221s)
  homebrew-cask   29.818s    29.936s    30.279s  |  30.316s    30.435s    29.942s  |  29.939s
                  (0.025s)   (0.020s)   (0.019s) |  (0.049s)   (0.032s)   (0.038s) |  (0.153s)
  homebrew-core   55.478s    55.534s    56.248s  |  56.551s    56.100s    56.445s  |  55.641s
                  (0.031s)   (0.024s)   (0.023s) |  (0.062s)   (0.038s)   (0.047s) |  (0.195s)
  jdk             19.418s    19.151s    20.246s  |  21.270s    20.037s    24.073s  |  25.916s
                  (1.260s)   (0.878s)   (2.105s) |  (3.154s)   (1.833s)   (6.133s) |  (7.732s)
  linux          100.837s   100.130s   101.244s  | 103.775s   101.645s   103.856s  | 109.365s
                  (2.027s)   (1.498s)   (2.030s) |  (4.319s)   (2.682s)   (5.316s) | (11.075s)
  llvm-project    31.188s    31.392s    31.442s  |  31.895s    31.479s    31.984s  |  32.863s
                  (0.334s)   (0.251s)   (0.345s) |  (0.724s)   (0.437s)   (0.895s) |  (1.794s)
  rails            5.607s     5.639s     5.694s  |   5.742s     5.720s     5.865s  |   6.087s
                  (0.084s)   (0.063s)   (0.095s) |  (0.189s)   (0.116s)   (0.252s) |  (0.467s)
  rust            13.250s    13.206s    13.422s  |  13.701s    13.426s    13.680s  |  13.949s
                  (0.163s)   (0.122s)   (0.169s) |  (0.359s)   (0.217s)   (0.440s) |  (0.880s)
  tensorflow      11.808s    11.608s    11.915s  |  12.379s    11.966s    12.860s  |  13.588s
                  (0.368s)   (0.267s)   (0.497s) |  (0.860s)   (0.517s)   (1.369s) |  (2.081s)
  webkit          30.469s    30.735s    30.945s  |  32.005s    31.004s    32.401s  |  33.303s
                  (0.501s)   (0.376s)   (0.733s) |  (1.212s)   (0.725s)   (2.084s) |  (3.044s)

So xxHash is indeed the fastest, even in our use case, and MurmurHash3
comes in second.  The time spent hashing with 7 seeds tends to be
around twice as much as with enhanced double hashing when using the
same hash function.  However, the time spent hashing is only a
fraction of the total runtime, and while its effect on total runtime
is measurable both in best-of-five and average, it tends to be smaller
than run-to-run noise.

As for availability of hash functions:

  - MurmurHash3 is widely available; both the reference implementation
    and a streaming-capable ANSI C implementation is in the public
    domain.

  - xxHash is allegedly available in a variety of programming
    languages, as shown on www.xxhash.com (supposedly, but I haven't
    been able to load that page since months... some cloudflare host
    error persists)

  - SHA256 is widely available, and it must be part of every Git
    implementation in the near future anyway, but it's slower than the
    others, and, more importantly, it doesn't scale for k > 8.

  - FNV1a is so simple that anyone can implement a variant that
    incrementally computes two hashes up to the next directory
    separator in one go in about 20 lines of code (though note that
    the above benchmarks didn't use such an implementation).  Alas,
    because of its higher false positive rate it's out anyway.

Conclusion: we should seriously consider using MurmurHash3 (or xxHash)
and hashing each path k times with k different seeds instead of any
double hashing scheme.

Merges
------

It's not clear whether it's worth computing, storing and using
modified path Bloom filters for all parents of merge commits.

  - The number of paths modified between a merge commit and its
    second..nth parents is in general considerably larger than between
    any commit and its first parent.  E.g. a lot of files are modified
    in Git's master branch while a topic cooks for a few weeks, and
    much more in Linux when a branch started from the previous release
    is merged near the end of the next merge window.

                                 Average number of
                                   modified paths
                    Percentage      compared to:
                     of merge     first    second
                      commits     parent   parent
      --------------------------------------------
      android-base     73.6%       14.1    1553.6
      cmssw            11.0%       16.1     977.4
      cpython          11.7%        5.1     933.7
      elasticsearch     8.4%       40.1     246.5
      gecko-dev         3.5%       23.5     602.0
      git              25.3%        4.0     394.8
      homebrew-cask     9.6%        2.7      42.8
      jdk              25.0%      184.1     408.2
      linux             7.4%       26.1    2268.0
      rails            22.2%        9.6     101.0
      rust             27.0%       15.7     397.3
      tensorflow        9.1%       39.2    1057.4

    Consequently:

      - The tree-diff machinery has to work that much more to gather
        modified paths for all parents of merge commits, significantly
        increasing the runtime of writing the commit-graph file.

      - Storing modified path Bloom filters for all parents of merge
        commits significantly increases the size of the Modified Path
        Bloom Filters chunk, though depending on the percentage of
        merge commits and on the size of the other chunks the relative
        size increase of the whole commit-graph file might not be all
        that much.

      - [TODO: A few old test results suggest that pathspec-limited
        revision walks with default history simplification using a
        commit-graph file storing modified path Bloom filters for all
        merge parents are a few percents slower than when storing
        Bloom filters only for first parents.  Even fewer old test
        results suggest that writing all Bloom filters for first
        parents first, and then all for second..nth parents might
        eliminate much of that runtime difference.
        Definitely need more benchmarking.]

  - During a pathspec-limited revision walk Git's default history
    simplification only checks whether the given path was modified
    between a merge commit and its second..nth parents when the path
    was modified between that merge commit and its first parent.  This
    usually happens rarely, though these second..nth parent diffs tend
    to be more expensive than first parent diffs (because of the
    considerably more modified paths the tree-diff machinery can't
    short-circuit that early).  Anyway, the potential speedup is low.

  - However, with '--full-history', i.e. without any history
    simplification, all merge commits are compared to all their
    parents, and typical additional speedups are in the range of
    2x-3x, while in some cases over 7x or 11x can be achieved by using
    modified path Bloom filters for all parents of merge commits.

Does it worth it?  For me personally it doesn't, but I don't know how
often others use '--full-history' and what trade-offs they might be
willing to make.

So the file format described here adds _optional_ support for storing
modified path Bloom filters for all parents of merge commits, and the
users can make this decision themselves.

[TODO: describe it!]

Deduplication
-------------

Some commits have identical modified path Bloom filters, because they
modify the same set of paths (or because they modify different set of
paths but happen to end up setting the same bit positions in the Bloom
filter).  By omitting duplicates from the Modified Path Bloom Filters
chunk its size can be reduced typically by around 5-20%, and in case
of the android-base repository by over 69%.

Explicitly allow that multiple entries of the Modified Path Bloom
Filter Index chunk can refer to the same offset in the Modified Path
Bloom Filters chunk.  This is important, because even if an
implementation doesn't perform this deduplication while writing the
commit-graph file, it must be prepared for multiple index entries
referring to the same offset in commit-graph file written by a
different implementation.

Modified Path Bloom Filter Excludes
-----------------------------------

Some repositories contain leading directories that are modified in the
great majority of commits, e.g. the homebrew-core repository's
'Formula' directory is modified in over 99.7% of commits, while
homebrew-cask's 'Casks' and rust's 'src' are modified in over 93% of
commits.  And there is 'src/main/java/com/company/division/project' in
convention-following Java projects...

Modified path Bloom filters can't speed up revision walks when the
pathspec is such a frequently modified leading directory, because
because of potential false positives we'll have to run tree-diff for
the majority of commits anyway.  And it doesn't really make sense to
query the history of such a leading directory in practice, because it
will list the majority of commits, so one might as well look straight
at the output of a pathspec-less 'git log'.

However, adding those frequently modified leading directories to the
modified path Bloom filters requires more space and increases the
probability of false positives.

So the file format described here adds support for excluding specific
paths from modified path Bloom filters by listing them in the Modified
Path Bloom Filter Excludes chunk.

[TODO: Figure out the details!]

Limitations
-----------

Because of the possibility of false positives, if a modified path
Bloom filter query returns 'possibly in set', then we have to invoke
tree-diff to make sure that the path in question was indeed modified
by the given commit.  Consequently, Bloom filters can't improve
performance that much when looking for the history of a frequently
modified path, because a lot of tree-diff invocations can't be
eliminated.  In the extreme case when looking for the history of a
path modified in every commit, then using Bloom filters will only add
extra overhead.

A modified path Bloom filter doesn't store the names of modified
paths, it only sets a couple of bits based on those paths' hashes.
Consequently, it can only be used when looking for the history of a
concrete path, and:

  - It can't be used with a pathspec containing wildcards like 'git
    log -- "*.h"'.

    However, it could still be used when the pathspec contains leading
    directories without wildcards, e.g. 'git log --
    "arch/x86/include/*.h", by limiting tree-diff only to commits
    modifying the 'arch/x86/include/' directory.

  - It can't tell which paths were modified by a given commit; for
    that we would still have to run tree-diff for the full tree.

Submodules [TODO]
-----------------

No modified path Bloom filters should be stored for commits modifying
submodules.

This is questionable, but is necessary to produce the same output with
and without modified path Bloom filters.  If 'dir/submod' is a gitlink
file, then currently 'git rev-list HEAD -- dir/submod/whatever' lists
all commits touching 'dir/submod', even when that 'whatever' has never
existed.  And that 'whatever' can be basically anything, so we will
not find them in any of our modified path Bloom filters, therefore in
a Bloom-filter-assisted revision walk we won't list any commits.

The only way around this is to not not write any modified path Bloom
filters for commits modifying submodules.

Note, however, that once upon a time that command wouldn't list
anything, either, but the behavior changed with commit 74b4f7f277
(tree-walk.c: ignore trailing slash on submodule in
tree_entry_interesting(), 2014-01-23) to what we have now.  As
74b4f7f277's log message only talks about handling 'dir/submod/' and
'dir/submod' (i.e. with and without trailing slash) consistently, I
suspect that this behavior change with 'dir/submod/anything' is an
unintended and undesired side effect.  However, as it involves
submodules I would rather not have an opinion :)

In any case, someone with more clues about submodules should take a
closer look and decide whether this is a bug or not, before this
modified path Bloom filter thing goes much further.  If it is a bug
indeed, then it should be fixed and the remark about submodules should
be removed from the modified path Bloom filter specs.  If the current
behavior is desired, then the remark about submodules should be
updated to use proper English (IMO it must be part of the spec,
because this is a subtle issue that developers of other
implementations (JGit, libgit2, etc.) might easily overlook).

Threats to validity
-------------------

  - Random paths are... random.  Picking random paths can
    over-represent rarely modified files.  Since modified path Bloom
    filters bring more benefits to rarely modified paths, the reported
    speedups later in the series might be higher than what the users
    will usually see.  (I suppose that users more often check the logs
    of frequently modified files than of rarely modified ones.)

    Though some of these random paths made me stumble upon the issue
    with submodules discussed above, so...

  - Bugs :)  It's not that hard to make subtle bugs that don't affect
    correctness, because the probabilistic nature of Bloom filters
    cover them up.  However, bugs like incorrectly calculating the
    size of a Bloom filter or having an off-by-one error in the
    filter's array handling affect the false positive rate and in turn
    the runtime of pathspec-limited revision walks.

Alternatives considered
-----------------------

Here are some alternatives that I've considered but discarded and
ideas that I haven't (yet) followed through:

  - One Bloom filter to rule them all?  No.

    While the first proof of concept implementation [2] demonstrated
    that by combining hashes of modified pathnames with commit and
    parent object ids it is possible to use a single big Bloom filter
    and achieve notable speedups, that approach has several drawbacks:

      - Poor locality: The bit positions that are set for any element
        are supposed to be random, so we need to access random parts
        of the big Bloom filter array for each checked commit and
        path, which is bad for the caches.  Furthermore, we would
        likely need to read the whole big Bloom filter array even when
        looking for only a handful of commits, e.g. for 'git log -3 --
        dir/file'.

      - Need to grow: As the repository grows, more and more paths
        will be added to the modified path Bloom filter, thus lowering
        its false positive rate.  Eventually we'll reach a point where
        we would need to increase the size of the Bloom filter, but a
        Bloom filter can't be resized, so we'll have to create a
        larger Bloom filter and re-fill it from scratch.  This is a
        considerably more expensive operation than creating a
        modified-path-Bloom-filter-less commit-graph from scratch.

      - Expired commits: Unreachable commits are eventually pruned by
        'git gc', but elements can't be removed from a Bloom filter.
        Consequently, all the bits set for paths modified by expired
        commits will remain set, slightly increasing the probability
        of false positives until the Bloom filter is eventually
        resized/rewritten.  There are enhanced Bloom filter variants
        that allow removal of elements, like a counting Bloom filter,
        but they come with additional complexity and require
        considerably more space.

      - Excluded commits: Sometimes we can't or just don't want to
        record the paths modified by a particular commit, so we need a
        way to record which commits don't have entries in the big
        Bloom filter.

      - There are plans for split commit-graph files, because
        rewriting the whole commit-graph file in big repositories to
        add only relatively few new commits seems to impose
        considerable overhead.  We would need a big modified path
        Bloom filter in each of the split commit-graph files that
        we'll combine together when the split commit-graph files are
        combined.  Bloom filters of the same size can be quickly
        combined by bitwise OR-ing their arrays, but this implies that
        the split commit-graph files holding only a few commits would
        have to contain as large a Bloom filter as is used in the
        shared commit-graph file holding most of the commits in the
        repository, wasting valuable storage space.  (Though the false
        positive rate for commits stored in the split commit-graph
        should be spectacularly low).

    Using a small Bloom filter for each commit avoids these issues,
    though, as discussed earlier, not without introducing issues of
    its own.

  - Xor filters "are a faster and more concise alternative to Bloom
    filters", but they turned out to be too large in our use case.

    Each Xor filter consists of a header and an array: the header
    contains a 64 bit seed and a 64 bit blocklength, while the array
    contains 3 * blocklength bytes.  Consequently, the space advantage
    of Xor filters compared to Bloom filters is only true for large
    filters, where the header's size is negligible compared to the
    array's size.  We, however, have a lot of very small filters, and
    all those extra headers cause an unacceptable size increase
    (partly because the Xor filter's header is 4 times the size of the
    header of our modified path Bloom filters, and partly because
    their headers make Xor filters too large to be embedded in the
    index chunk).

                     MPBF chunk      Total
                        size       Xor Filter
                     (no dedup)       size
      ------------------------------------------------
      android-base     8620198      28840220     +334%
      cmssw           10347018      21538283     +208%
      cpython           526381       6393731    +1214%
      elasticsearch    4733274       9101088     +192%
      gcc              3724544      14637474     +393%
      gecko-dev       20591010      56887518     +276%
      git               160718       3115827    +1938%
      glibc             759132       3162898     +416%
      go                458375       2676306     +584%
      homebrew-cask      94383       5272317    +5586%
      homebrew-core      27823       8051585   +28938%
      jdk             13213208      15574498     +117%
      linux           26575278      69577374     +261%
      llvm-project     3988133      20235803     +507%
      rails             958691       5172643     +539%
      rust             1985782       7082949     +356%
      tensorflow       4173570       8333062     +200%
      webkit           5891751      15927504     +270%

    https://github.com/FastFilter/xor_singleheader

  - fastrange is "a fast alternative to the modulo reduction", but it
    turned out to significantly increase the average false positive
    rate when using (enhanced) double hashing.

    The traditional way to fairly map an integer value to the range
    [0,m) is the modulo reduction, i.e. 'h % m'.  Alas, integer
    division is a costly instruction, and we have to do it a lot while
    checking modified path Bloom filters for every commit.  Fastrange
    proposes to use the combination of an integer multiplication and a
    shift as: '((uint64_t)h * (uint64_t)m) >> 32', as it is supposed
    to be faster while being just as fair.

    In my benchmarks fastrange did indeed reduce the time spent
    checking modified path Bloom filters by up to ~10% (no table about
    this), though that's usually only a fraction of the total runtime
    of a pathspec-limited revision walk.  Unfortunately, as the table
    below shows, it also increased the average false positive rate by
    about 5x when using enhanced double hashing.  I haven't yet
    investigated why this happens.  However, fastrange is a viable
    option when using the same hash function with 'k' different seeds,
    as in that case it's on par with the good old % operator, and in
    fact its false positive rate happens to be slightly lower.

                           Enhanced        |
                        double hashing     |        7 seeds
                                           |
                    % operator  fastrange  | % operator  fastrange
      -------------------------------------+----------------------
      android-base  *0.004155%  0.013987%  |  0.004214%  0.004873%
      cmssw          0.003953%  0.006129%  | *0.003344%  0.005513%
      cpython        0.016607%  0.033979%  | *0.016120%  0.018069%
      elasticsearch  0.003249% *0.003149%  |  0.003167%  0.003333%
      gcc            0.008919%  0.013098%  |  0.010281% *0.007324%
      gecko-dev      0.000725%  0.000923%  |  0.000804% *0.000631%
      git            0.069921%  0.106424%  | *0.063025%  0.065347%
      glibc          0.016389%  0.025323%  | *0.016278%  0.016796%
      go            *0.012616%  0.020914%  |  0.013009%  0.012915%
      homebrew-cask  0.009096%  0.182016%  |  0.007199% *0.006937%
      homebrew-core  0.005360%  0.092957%  | *0.003041%  0.004251%
      jdk            0.006431%  0.005732%  |  0.005873% *0.005673%
      linux          0.007494%  0.013862%  |  0.007764% *0.007127%
      llvm-project   0.003913%  0.004829%  |  0.003367% *0.002943%
      rails          0.013134%  0.019524%  | *0.012708%  0.012979%
      rust           0.025509%  0.029824%  | *0.024245%  0.025174%
      tensorflow     0.008243%  0.012485%  | *0.007907%  0.008120%
      webkit        *0.001007%  0.001044%  | *0.000999%  0.001220%
      -------------------------------------+----------------------
      all            0.006085%  0.029034%  |  0.005646% *0.005579%
      w/o homebrew   0.005968%  0.009478%  |  0.005888% *0.005662%

    https://github.com/lemire/fastrange

  - fastmod "is a header file for fast 32-bit division remainders on
    64-bit hardware", promising faster computation when the same
    divisor is (re)used for multiple remainder calculations.  This is
    exactly what we are doing when checking a modified path Bloom
    filter, as we have to map a bunch of unsigned 32 bit hash values
    to bit positions with pos = hash % nr_bits, using the same nr_bits
    for all hash values.

    However, benchmarking has not shown conclusive improvements: while
    fastmod slightly reduces the time spent checking modified path
    Bloom filters in some repositories, it slightly increases in
    others, and any effect on the runtime of the whole
    pathspec-limited revision walk is below the noise.  Perhaps we
    have too many "embedded" modified path Bloom filters that we can
    check with a mask.  Or perhaps my CPU is too old and its integer
    multiplication instruction is not that much faster compared to
    integer division.

    https://github.com/lemire/fastmod

  - Varints.  Using 4 bytes for the size of each Bloom filter in the
    Modified Path Bloom Filters chunk is a lot more than necessary.
    How much space can be saved by using varints?  How much is the
    runtime overhead of decoding those varints?  How can we ensure
    that varint decoding doesn't read past the end of the mmapped
    memory region?

  - Path-depth-dependent number of hash functions.
    As shown later in this patch series, we can drastically reduce the
    average false positive rate by checking the presence of the
    pathspec's leading directories as well.  With that change the
    false positive rate will depend on the depth of the path in the
    directory tree as well: the deeper the path (i.e. the more leading
    directories can be checked), the lower the false positive rate,
    sometimes maybe even unnecessarily low (0% is not an
    approximation, but there was not a single false positive, though
    the number of checked paths at those depths is quite low):

                     linux.git                  webkit.git

               Number of                    Number of
      Path      checked   Average false      checked   Average false
      depth    paths at   positive rate     paths at   positive rate
      depth   that depth  at that depth    that depth  at that depth
      --------------------------------------------------------------
        1           2      0.31974050%          0       n/a
        2          97      0.06539501%         16       0.04927891%
        3        1052      0.01782569%        176       0.00472128%
        4        1674      0.00396338%        586       0.00399830%
        5        1315      0.00350913%        775       0.00089653%
        6         548      0.00149110%       1396       0.00025000%
        7         113      0.00057555%       1004       0.00004777%
        8         124      0.00085726%        397       0.00001380%
        9          34      0.00025182%        311       0.00001468%
       10          14      0%                 250       0.00000182%
       11           8      0%                  45       0%
       12          19      0%                  39       0%
       13           0      n/a                  3       0%
       14           0      n/a                  1       0%
       15           0      n/a                  1       0%
      --------------------------------------------------------------
      all        5000      0.007607%         5000       0.001012%

    So we could use fewer hashes when adding deep paths to modified
    path Bloom filters, saving same space while still achieving quite
    low false positive rate.  Or when aiming for consistently lower
    false positive rate, then we could use more hashes only for
    shallower paths.

[1] The repositories used in the above benchmarks are:

    android-base:
      https://android.googlesource.com/platform/frameworks/base
    cmssw:
      https://github.com/cms-sw/cmssw
    cpython:
      https://github.com/python/cpython
    elasticsearch:
      https://github.com/elastic/elasticsearch
    gcc:
      git://gcc.gnu.org/git/gcc.git
    gecko-dev:
      https://github.com/mozilla/gecko-dev
    git:
      https://git.kernel.org/pub/scm/git/git.git
    glibc:
      git://sourceware.org/git/glibc.git
    go:
      https://github.com/golang/go
    homebrew-cask:
      https://github.com/homebrew/homebrew-cask
    homebrew-core:
      https://github.com/homebrew/homebrew-core
    jdk:
      https://github.com/openjdk/jdk
    linux:
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
    llvm-project:
      https://github.com/llvm/llvm-project.git
    rails:
      https://github.com/rails/rails
    rust:
      https://github.com/rust-lang/rust
    tensorflow:
      https://github.com/tensorflow/tensorflow
    webkit:
      git://git.webkit.org/WebKit.git

[2] First PoC Bloom filter experiment:
    https://public-inbox.org/git/20181009193445.21908-1-szeder.dev@gmail.com/

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 .../technical/commit-graph-format.txt         | 125 ++++++++++++++++++
 1 file changed, 125 insertions(+)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index efd0f99d8b..6b041f94ee 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -99,6 +99,131 @@ CHUNK DATA:
       file's OID Lookup chunk is equal to i plus the number of commits in all
       base graphs.  If B is non-zero, this chunk must exist.
 
+  Modified Path Bloom Filter Index (ID: {'M', 'P', 'B', 'I'}) [Optional]
+    2 + (N * 8) bytes
+
+    * 1-byte number (K) of hashes per path
+
+    * An array of 8 byte entries, one for each N commits stored in the
+      commit-graph, with the ith entry associated to the ith commit in the
+      OID Lookup chunk.
+      Each entry should be interpreted as follows:
+
+      - If all bits in the 8 bytes are set, then there is no modified path
+	Bloom filter stored for this commit.
+
+	Thou shalt not store a modified path Bloom filter for a commit that
+	modifies a submodule (gitlink file)!
+
+      - If the most significant bit of the first byte is set, then the
+	remaining 63 bits represent the bit array of an "embedded" Bloom
+	filter containing the set of paths that were modified between the
+	associated commit and its first parent, or in case of a root commit
+	between the associated commit and the empty tree.  All embedded
+	modified path Bloom filters use the same hashing scheme that is
+	used in the Modified Path Bloom Filters chunk, see below.
+
+      - If the second most significant bit of the first byte is set, then
+	the last four bytes form an array position into the Modified Path
+	Bloom Filter Merge Index chunk.  The remaining bits in the first
+	four bytes should be set to 0.  This can optionally be used to
+	store Bloom filters for all parents of merge commits.
+	[TODO: Only the last four bytes?  Shouldn't that be last 62 bits?!]
+
+      - If the two most significant bits of the first byte are unset, then
+	the entry represents an 8 byte offset pointing to a Bloom filter
+	in the Modified Path Bloom Filters chunk, which contains the set
+	of paths that were modified between the associated commit and its
+	first parent, or in case of a root commit between the associated
+	commit and the empty tree.  This offset is relative to the start
+	of the Modified Path Bloom Filters chunk.  Multiple entries can
+	point to the same offset.
+
+  Modified Path Bloom Filter Merge Index (ID: {'M', 'P', 'B', 'M'}) [Optional]
+      An array of 8 byte entries.
+      If a merge commit's entry in the Modified Path Bloom Filter Index
+      chunk stores an array position into this chunk, then the entry at
+      that position is associated with that merge commit and its first
+      parent, the next entry with that merge commit and its second parent,
+      etc. for all parents of that merge commit.
+
+      Each entry should be interpreted as follows, similar to the entries
+      in the Modified Path Bloom Filter Index chunk:
+
+      - If all bits in the 8 bytes are set, then there is no modified path
+	Bloom filter stored for the associated merge commit and its
+	corresponding parent.
+
+      - If the MSB of the first byte is set, then the remaining 63 bits
+	represent the bit array of an "embedded" Bloom filter containing
+	the set of paths that were modified between the associated merge
+	commit and its corresponding parent.  All embedded modified path
+	Bloom filters use the same hashing scheme that is used in the
+	Modified Path Bloom Filters chunk, see below.
+
+      - If the most significant bit of the first byte is unset, then the
+	entry represents an 8 byte offset pointing to a Bloom filter in
+	the Modified Path Bloom Filters chunk, which contains the set of
+	paths that were modified between the associated merge commit and
+	its corresponding parent.  This offset is relative to the start of
+	the Modified Path Bloom Filters chunk.  Multiple entries can point
+	to the same offset.
+
+      This chunk should not exist if the commit-graph file doesn't contain
+      a Modified Path Bloom Filter Index chunk.  This chunk can be omitted
+      if the Modified Path Bloom Filter Index doesn't contain any array
+      indexes pointing into it.  This chunk can be omitted even when the
+      commit-graph contains merge commits.
+
+  Modified Path Bloom Filters (ID: {'M', 'P', 'B', 'F'}) [Optional]
+      A number of consecuive modified path Bloom filters of varying sizes.
+      Each Bloom filter consists of 4 + M bytes:
+
+      - The first 4 bytes specify the number of m bits in the Bloom
+	filter's bit array.
+
+      - The next M bytes hold the Bloom filter's bit array of m bits.
+
+      The bits in the array are indexed in network byte order, i.e. the
+      array's 0th bit is the least significant bit of the last byte, and
+      the (m-1)th bit is in the first byte.  If m is not a multiple of 8,
+      then the unused leading bits should be set to 0.
+
+      For each path (including leading directories) modified between a
+      commit and its parent K bit positions should be set using the
+      following hashing scheme:
+
+	for i in [0, K)
+	  bit_pos[i] = (hash1 + i * hash2 + (i * i * i - i) / 6) % m
+
+      where hash1 and hash2 are the 32 bit MurmurHash3 hashes of the path
+      with seeds 0xe83c5163 and 0x3b376b0c, respectively.  These bit
+      positions should be calculated using 32 bit unsigned integer
+      arithmetic.  The directory separator is a single '/'.  Directories
+      should be added without a trailing '/'.
+
+      The order of Bloom filters in this chunk is unspecified.  Multiple
+      entries in the Modified Path Bloom Filter Index or Modified Path
+      Bloom Filter Merge Index chunks can point to the same Bloom filter
+      in this chunk.
+
+      This chunk should not exist if the commit-graph doesn't contain a
+      Modified Path Bloom Filter Index chunk.  This chunk can be omitted
+      if neither the Modified Path Bloom Filter Index nor the Modified
+      Path Bloom Filter Merge Index chunks contain any offsets pointing
+      into it.
+
+  Modified Path Bloom Filter Excludes (ID: {'M', 'P', 'B', 'X'}) [Optional]
+      A number of consecutive null terminated strings in memcmp() order.
+      Paths in these strings should not be added to any modified path
+      Bloom filters.
+      [TODO: Clarify!  Only literal matches, or should we support
+      globbing/patterns as well?  Only full match, or leading directories
+      as well?]
+
+      This chunk should not exist if the commit-graph doesn't contain a
+      Modified Path Bloom Filter Index chunk.
+
 TRAILER:
 
 	H-byte HASH-checksum of all of the above.

From patchwork Fri May 29 08:50:20 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578137
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5C832739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:26 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 4116520723
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:26 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="YvBAVGLs"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726787AbgE2IvY (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:24 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44392 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726911AbgE2IvX (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:23 -0400
Received: from mail-wm1-x342.google.com (mail-wm1-x342.google.com
 [IPv6:2a00:1450:4864:20::342])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CCAC4C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:22 -0700 (PDT)
Received: by mail-wm1-x342.google.com with SMTP id r9so2317009wmh.2
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=r0KsrOctfKXYg5G9FZKB2N4Z6/V+Ve9RekzoVrZUBgk=;
        b=YvBAVGLsAgBJcCvb0c3IlVTrxVnlU3Nl3vmxit9ssTvaRgmPmLj2GQrsC0UVn8PGhj
         8YU1Pp/lrG/VFcOsM5FJRbKGMHaL4sk3QXmnpJ0JD/LZS2pOS6nJ0niYsB1olqKuxI1o
         zEwR9E+bVe0MHsbE31XMx+PDDK5YpBmXT2F3BOpjc1h5LI6fDveEZ5xr0v5nR2ONm1jR
         xBFe4e7bcaIDTrCR9mgMPILsvKthTOg/vi8ro/sEacOyw9+sg70DJvmJSqkjN8bHYiIK
         hxHdRHrAmfHF4YK0RS0FDkxook3IT4w1OulUWyAzuLstTIX9/uIMKAuIC1Vaq4fjswor
         TtIg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=r0KsrOctfKXYg5G9FZKB2N4Z6/V+Ve9RekzoVrZUBgk=;
        b=D4iPJd3+U6C42L/pAMueOuytRsM0WsmqxG/9Eb9Qt9OCLk/ga3JksI5KM1zAg2G0Uo
         DQStxFMmIhy2QofrUPRNq0Af1X0OhqjfmBq714Iq4BPbPOquZR5ZCSC/Uk9GS0DTYWmr
         nBw4jnGBRfmttMZGRkomSQWygO9UIDNMn8S5a4XGQJhStq/CLxuhuOQPyVZk/OBQ2+Ix
         b4cCuOhE/efQ9eJFs3NbrNXTpeg73BLlOQ52u6O4nBBaKjjYUBdV5DzV+lV1i+LaGkJ/
         PdSpOS6YdBclcDjogwhnCOhBg3cXL0SkNdNumiqOBMeStXRCMGXcA6seZsnUfLqvxk26
         J/aA==
X-Gm-Message-State: AOAM532CpD1QQ9s4pr9hlVVsB2Jli2NJEKyaIgmZDyeAKzDGW4ARn/O8
        BF4TXfxL1KyrhSHUqKANGsqOEgnu
X-Google-Smtp-Source: 
 ABdhPJzZPKmqV+e+vCKdHzziRWkOcwc9bNvI4HrSCyo7E+ylEL9F4vVhgHSOz8z+urjvY4hlv3lwWQ==
X-Received: by 2002:a1c:a5ce:: with SMTP id o197mr221696wme.85.1590742281110;
        Fri, 29 May 2020 01:51:21 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.20
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:20 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 16/34] Add a generic and minimal Bloom filter implementation
Date: Fri, 29 May 2020 10:50:20 +0200
Message-Id: <20200529085038.26008-17-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

We'll use Bloom filters to store the paths modified between each
commit and one of its parents.  This means many small Bloom filters,
which puts some constraints on the hashing scheme, as discussed in the
previous commit.  However, Bloom filters might turn out to be useful
in other parts of Git as well, where they might be free from those
constraints.  So add a Bloom filter implementation that is fairly
generic, and copes only with allocation, initialization and release of
the filter and with manipulating the bits in its bit array, but leaves
all the hashing to the user of the filter.

Unfortunately, it's still somewhat specific, though, in how it maps
the hashes to specific bits in memory, i.e. how it maps the hashes to
array indices and how the bit array is indexed.  We'll see how that
will pan out...
---
 Makefile       |  1 +
 bloom-filter.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++
 bloom-filter.h | 47 ++++++++++++++++++++++++++
 3 files changed, 139 insertions(+)
 create mode 100644 bloom-filter.c
 create mode 100644 bloom-filter.h

diff --git a/Makefile b/Makefile
index 09f98b777c..a9db1e8d9b 100644
--- a/Makefile
+++ b/Makefile
@@ -839,6 +839,7 @@ LIB_OBJS += base85.o
 LIB_OBJS += bisect.o
 LIB_OBJS += blame.o
 LIB_OBJS += blob.o
+LIB_OBJS += bloom-filter.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
 LIB_OBJS += bundle.o
diff --git a/bloom-filter.c b/bloom-filter.c
new file mode 100644
index 0000000000..919da26462
--- /dev/null
+++ b/bloom-filter.c
@@ -0,0 +1,91 @@
+#include "bloom-filter.h"
+
+void bloom_filter_init(struct bloom_filter *bf, unsigned int nr_hashes,
+		       uint32_t nr_elements)
+{
+	/* n * k / ln(2) ≈ n * k / 0.69315 ≈ n * k * 10 / 7 */
+	uint32_t nr_bits = st_mult(st_mult(nr_elements, nr_hashes), 10) / 7;
+	uint32_t nr_bytes = st_add(nr_bits, 7) / 8;
+	/*
+	 * But we round up to fully utilize all bytes, thus lowering the
+	 * probability of false positives a bit.
+	 */
+	bf->nr_bits = nr_bytes * 8;
+	bf->bits = xcalloc(nr_bytes, sizeof(*(bf->bits)));
+}
+
+void bloom_filter_init_with_size(struct bloom_filter *bf, uint32_t nr_bits)
+{
+	uint32_t nr_bytes = st_add(nr_bits, 7) / 8;
+	bf->nr_bits = nr_bits;
+	bf->bits = xcalloc(nr_bytes, sizeof(*(bf->bits)));
+}
+
+void bloom_filter_free(struct bloom_filter *bf)
+{
+	FREE_AND_NULL(bf->bits);
+	bf->nr_bits = 0;
+}
+
+uint32_t bloom_filter_bytes(struct bloom_filter *bf)
+{
+	return (bf->nr_bits + 7) / 8;
+}
+
+void bloom_filter_clear_all_bits(struct bloom_filter *bf)
+{
+	memset(bf->bits, 0, bloom_filter_bytes(bf));
+}
+
+void bloom_filter_set_all_bits(struct bloom_filter *bf)
+{
+	memset(bf->bits, 0xff, bloom_filter_bytes(bf));
+}
+
+static inline uint32_t bit_offset(uint32_t nr_bits, uint32_t hash)
+{
+	return hash % nr_bits;
+}
+
+static inline uint32_t byte_offset(uint32_t nr_bits, uint32_t bit_offset)
+{
+	return (nr_bits - 1) / 8 - bit_offset / 8;
+}
+
+static inline uint8_t byte_mask(uint32_t bit_offset)
+{
+	return 1 << (bit_offset % 8);
+}
+
+static inline void bloom_filter_set_one_bit(struct bloom_filter *bf,
+					    uint32_t hash)
+{
+	uint32_t offset = bit_offset(bf->nr_bits, hash);
+	bf->bits[byte_offset(bf->nr_bits, offset)] |= byte_mask(offset);
+}
+
+void bloom_filter_set_bits(struct bloom_filter *bf, const uint32_t *hashes,
+			   unsigned int nr_hashes)
+{
+	unsigned int i;
+	for (i = 0; i < nr_hashes; i++)
+		bloom_filter_set_one_bit(bf, hashes[i]);
+}
+
+static inline int bloom_filter_check_one_bit(struct bloom_filter *bf,
+					     uint32_t hash)
+{
+	uint32_t offset = bit_offset(bf->nr_bits, hash);
+	return bf->bits[byte_offset(bf->nr_bits, offset)] & byte_mask(offset);
+}
+
+enum bloom_result bloom_filter_check_bits(struct bloom_filter *bf,
+					  const uint32_t *hashes,
+					  unsigned int nr_hashes)
+{
+	unsigned int i;
+	for (i = 0; i < nr_hashes; i++)
+		if (!bloom_filter_check_one_bit(bf, hashes[i]))
+			return BLOOM_DEFINITELY_NOT;
+	return BLOOM_POSSIBLY_YES;
+}
diff --git a/bloom-filter.h b/bloom-filter.h
new file mode 100644
index 0000000000..2fcd4a2c19
--- /dev/null
+++ b/bloom-filter.h
@@ -0,0 +1,47 @@
+#ifndef BLOOM_FILTER_H
+#define BLOOM_FILTER_H
+
+#include "git-compat-util.h"
+
+enum bloom_result {
+	/*
+	 * BLOOM_CANT_TELL is a value that a caller can use to report that
+	 * a Bloom filter is not available; bloom_filter_check_bits() will
+	 * never return it.
+	 */
+	BLOOM_CANT_TELL = -1,
+	BLOOM_DEFINITELY_NOT = 0,
+	BLOOM_POSSIBLY_YES = 1
+};
+
+struct bloom_filter {
+	uint32_t nr_bits;
+	uint8_t *bits;
+};
+
+/*
+ * Initialize a Bloom filter with the number of bits that is (close to)
+ * optimal to hold the given number of elements using the given number
+ * of hashes per element.
+ */
+void bloom_filter_init(struct bloom_filter *bf, uint32_t nr_hashes,
+		       uint32_t nr_elements);
+
+/* Initialize a Bloom filter with the given number of bits */
+void bloom_filter_init_with_size(struct bloom_filter *bf, uint32_t nr_bits);
+
+void bloom_filter_free(struct bloom_filter *bf);
+
+/* Return the size of the Bloom filter's bit array in bytes */
+uint32_t bloom_filter_bytes(struct bloom_filter *bf);
+
+void bloom_filter_clear_all_bits(struct bloom_filter *bf);
+void bloom_filter_set_all_bits(struct bloom_filter *bf);
+
+void bloom_filter_set_bits(struct bloom_filter *bf, const uint32_t *hashes,
+			   unsigned int nr_hashes);
+enum bloom_result bloom_filter_check_bits(struct bloom_filter *bf,
+					  const uint32_t *hashes,
+					  unsigned int nr_hashes);
+
+#endif

From patchwork Fri May 29 08:50:21 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578139
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 64426739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:29 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 47AC5207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:29 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="O2w1LEG/"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726926AbgE2Iv2 (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44400 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726911AbgE2IvZ (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:25 -0400
Received: from mail-wr1-x441.google.com (mail-wr1-x441.google.com
 [IPv6:2a00:1450:4864:20::441])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0B980C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:25 -0700 (PDT)
Received: by mail-wr1-x441.google.com with SMTP id l10so2515873wrr.10
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=Yz35yalIF2hgDd3mYWS8UBFeKb045/+YsLCXFT0JM9Y=;
        b=O2w1LEG/KmcH+Hn58754StZVV4TfWgu5mqYCX5+huHcFKAgpZz1M9dsnyvPKPMl0zd
         el9WEL7/AzLVE5z4HcUMMt54klz1Ad2CJYqD/KSleileTPGaNuF+1s2GaCh8WVo5DCHv
         7r1qhQkD9wGWaRkvNLNGef76lPp1KKwSOknGtCNFJ/Q9R46Na+ftcJdF99k/P3ZGjCy2
         KVM1I5BUWIKIB40/YuPXQaHDZeQJgrmrXoP/J8nGM9DDdWi7bjD21eKgPWnV+Gj1ibAA
         FzVvn7U6OqcuAtmC0ll4KtsqytjB4yceGSn5+vRCqgnIHa/rhyg19lB0oOXLmYxuS6fL
         4FHQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=Yz35yalIF2hgDd3mYWS8UBFeKb045/+YsLCXFT0JM9Y=;
        b=lgjOQU5I0cvmlmMr4RQH0beqcUr65/W/OoB56DF3OOurlEku9z8MQpE90DvFH8sM13
         m1ThTwpmM3JbllEUC6XBFLLai7k9C/Z4/02kHj8xfQYQPEA1rZ96ejKG0kMvlMCBfoiG
         wZAMnZNy77QslEJz/UHtCU5pU82hTIuOzfS0+rx7v6q1bY0/3TXcm3W7jZyhuHUEAKso
         A0/AWywvFi3FBxLB5g5s30gekoNuDCUvYNxfgOVJ/s/e17LDOqzore25s86Q67czTMTj
         T5dHSAN5U0mtWKVlOUqp2qqK0iGnT1h/d2zP5DYrAEGFTvYStRaCXLC8iFI8Jxh7zBg2
         Kz+g==
X-Gm-Message-State: AOAM532BXIM9v+UHNoW3CVevL3/fA7MhjZ4aLGlijxDkvwo3Vab8oPyM
        zcyN69DQNKlhp++LV2k+n/9vZZCL
X-Google-Smtp-Source: 
 ABdhPJx3/8S93y8jLmoepNus8JPtOuyIllllO1Ldo3bNfIZ9uqgL4GPPMkrNVfr42P69F2d8P9K8fg==
X-Received: by 2002:adf:fb0f:: with SMTP id c15mr8141049wrr.410.1590742282823;
        Fri, 29 May 2020 01:51:22 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.21
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:21 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 17/34] Import a streaming-capable Murmur3 hash function
 implementation
Date: Fri, 29 May 2020 10:50:21 +0200
Message-Id: <20200529085038.26008-18-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Import the streaming-capable public domain Murmur3 hash function
implementation from https://github.com/aappleby/smhasher/.
It's imported as-is, with 2 space indentation (the horror!) and
whitespace errors...

The reason for importing the streaming-capable implementation is that we
might get some benefit when hashing the paths 'dir', 'dir/subdir',
'dir/subdir/file' one ofter the other.
---
 Makefile          |   1 +
 compat/PMurHash.c | 291 ++++++++++++++++++++++++++++++++++++++++++++++
 compat/PMurHash.h |  62 ++++++++++
 3 files changed, 354 insertions(+)
 create mode 100644 compat/PMurHash.c
 create mode 100644 compat/PMurHash.h

diff --git a/Makefile b/Makefile
index a9db1e8d9b..1338f10796 100644
--- a/Makefile
+++ b/Makefile
@@ -853,6 +853,7 @@ LIB_OBJS += commit.o
 LIB_OBJS += commit-graph.o
 LIB_OBJS += commit-reach.o
 LIB_OBJS += compat/obstack.o
+LIB_OBJS += compat/PMurHash.o
 LIB_OBJS += compat/terminal.o
 LIB_OBJS += config.o
 LIB_OBJS += connect.o
diff --git a/compat/PMurHash.c b/compat/PMurHash.c
new file mode 100644
index 0000000000..df8842aa29
--- /dev/null
+++ b/compat/PMurHash.c
@@ -0,0 +1,291 @@
+/*-----------------------------------------------------------------------------
+ * MurmurHash3 was written by Austin Appleby, and is placed in the public
+ * domain.
+ *
+ * This implementation was written by Shane Day, and is also public domain.
+ *
+ * This is a portable ANSI C implementation of MurmurHash3_x86_32 (Murmur3A)
+ * with support for progressive processing.
+ */
+
+/*-----------------------------------------------------------------------------
+ 
+If you want to understand the MurmurHash algorithm you would be much better
+off reading the original source. Just point your browser at:
+http://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp
+
+
+What this version provides?
+
+1. Progressive data feeding. Useful when the entire payload to be hashed
+does not fit in memory or when the data is streamed through the application.
+Also useful when hashing a number of strings with a common prefix. A partial
+hash of a prefix string can be generated and reused for each suffix string.
+
+2. Portability. Plain old C so that it should compile on any old compiler.
+Both CPU endian and access-alignment neutral, but avoiding inefficient code
+when possible depending on CPU capabilities.
+
+3. Drop in. I personally like nice self contained public domain code, making it
+easy to pilfer without loads of refactoring to work properly in the existing
+application code & makefile structure and mucking around with licence files.
+Just copy PMurHash.h and PMurHash.c and you're ready to go.
+
+
+How does it work?
+
+We can only process entire 32 bit chunks of input, except for the very end
+that may be shorter. So along with the partial hash we need to give back to
+the caller a carry containing up to 3 bytes that we were unable to process.
+This carry also needs to record the number of bytes the carry holds. I use
+the low 2 bits as a count (0..3) and the carry bytes are shifted into the
+high byte in stream order.
+
+To handle endianess I simply use a macro that reads a uint32_t and define
+that macro to be a direct read on little endian machines, a read and swap
+on big endian machines, or a byte-by-byte read if the endianess is unknown.
+
+-----------------------------------------------------------------------------*/
+
+
+#include "PMurHash.h"
+
+/* I used ugly type names in the header to avoid potential conflicts with
+ * application or system typedefs & defines. Since I'm not including any more
+ * headers below here I can rename these so that the code reads like C99 */
+#undef uint32_t
+#define uint32_t MH_UINT32
+#undef uint8_t
+#define uint8_t  MH_UINT8
+
+/* MSVC warnings we choose to ignore */
+#if defined(_MSC_VER)
+  #pragma warning(disable: 4127) /* conditional expression is constant */
+#endif
+
+/*-----------------------------------------------------------------------------
+ * Endianess, misalignment capabilities and util macros
+ *
+ * The following 3 macros are defined in this section. The other macros defined
+ * are only needed to help derive these 3.
+ *
+ * READ_UINT32(x)   Read a little endian unsigned 32-bit int
+ * UNALIGNED_SAFE   Defined if READ_UINT32 works on non-word boundaries
+ * ROTL32(x,r)      Rotate x left by r bits
+ */
+
+/* Convention is to define __BYTE_ORDER == to one of these values */
+#if !defined(__BIG_ENDIAN)
+  #define __BIG_ENDIAN 4321
+#endif
+#if !defined(__LITTLE_ENDIAN)
+  #define __LITTLE_ENDIAN 1234
+#endif
+
+/* I386 */
+#if defined(_M_IX86) || defined(__i386__) || defined(__i386) || defined(i386)
+  #define __BYTE_ORDER __LITTLE_ENDIAN
+  #define UNALIGNED_SAFE
+#endif
+
+/* gcc 'may' define __LITTLE_ENDIAN__ or __BIG_ENDIAN__ to 1 (Note the trailing __),
+ * or even _LITTLE_ENDIAN or _BIG_ENDIAN (Note the single _ prefix) */
+#if !defined(__BYTE_ORDER)
+  #if defined(__LITTLE_ENDIAN__) && __LITTLE_ENDIAN__==1 || defined(_LITTLE_ENDIAN) && _LITTLE_ENDIAN==1
+    #define __BYTE_ORDER __LITTLE_ENDIAN
+  #elif defined(__BIG_ENDIAN__) && __BIG_ENDIAN__==1 || defined(_BIG_ENDIAN) && _BIG_ENDIAN==1
+    #define __BYTE_ORDER __BIG_ENDIAN
+  #endif
+#endif
+
+/* gcc (usually) defines xEL/EB macros for ARM and MIPS endianess */
+#if !defined(__BYTE_ORDER)
+  #if defined(__ARMEL__) || defined(__MIPSEL__)
+    #define __BYTE_ORDER __LITTLE_ENDIAN
+  #endif
+  #if defined(__ARMEB__) || defined(__MIPSEB__)
+    #define __BYTE_ORDER __BIG_ENDIAN
+  #endif
+#endif
+
+/* Now find best way we can to READ_UINT32 */
+#if __BYTE_ORDER==__LITTLE_ENDIAN
+  /* CPU endian matches murmurhash algorithm, so read 32-bit word directly */
+  #define READ_UINT32(ptr)   (*((uint32_t*)(ptr)))
+#elif __BYTE_ORDER==__BIG_ENDIAN
+  /* TODO: Add additional cases below where a compiler provided bswap32 is available */
+  #if defined(__GNUC__) && (__GNUC__>4 || (__GNUC__==4 && __GNUC_MINOR__>=3))
+    #define READ_UINT32(ptr)   (__builtin_bswap32(*((uint32_t*)(ptr))))
+  #else
+    /* Without a known fast bswap32 we're just as well off doing this */
+    #define READ_UINT32(ptr)   (ptr[0]|ptr[1]<<8|ptr[2]<<16|ptr[3]<<24)
+    #define UNALIGNED_SAFE
+  #endif
+#else
+  /* Unknown endianess so last resort is to read individual bytes */
+  #define READ_UINT32(ptr)   (ptr[0]|ptr[1]<<8|ptr[2]<<16|ptr[3]<<24)
+
+  /* Since we're not doing word-reads we can skip the messing about with realignment */
+  #define UNALIGNED_SAFE
+#endif
+
+/* Find best way to ROTL32 */
+#if defined(_MSC_VER)
+  #include <stdlib.h>  /* Microsoft put _rotl declaration in here */
+  #define ROTL32(x,r)  _rotl(x,r)
+#else
+  /* gcc recognises this code and generates a rotate instruction for CPUs with one */
+  #define ROTL32(x,r)  (((uint32_t)x << r) | ((uint32_t)x >> (32 - r)))
+#endif
+
+
+/*-----------------------------------------------------------------------------
+ * Core murmurhash algorithm macros */
+
+#define C1  (0xcc9e2d51)
+#define C2  (0x1b873593)
+
+/* This is the main processing body of the algorithm. It operates
+ * on each full 32-bits of input. */
+#define DOBLOCK(h1, k1) do{ \
+        k1 *= C1; \
+        k1 = ROTL32(k1,15); \
+        k1 *= C2; \
+        \
+        h1 ^= k1; \
+        h1 = ROTL32(h1,13); \
+        h1 = h1*5+0xe6546b64; \
+    }while(0)
+
+
+/* Append unaligned bytes to carry, forcing hash churn if we have 4 bytes */
+/* cnt=bytes to process, h1=name of h1 var, c=carry, n=bytes in c, ptr/len=payload */
+#define DOBYTES(cnt, h1, c, n, ptr, len) do{ \
+    int _i = cnt; \
+    while(_i--) { \
+        c = c>>8 | *ptr++<<24; \
+        n++; len--; \
+        if(n==4) { \
+            DOBLOCK(h1, c); \
+            n = 0; \
+        } \
+    } }while(0)
+
+/*---------------------------------------------------------------------------*/
+
+/* Main hashing function. Initialise carry to 0 and h1 to 0 or an initial seed
+ * if wanted. Both ph1 and pcarry are required arguments. */
+void PMurHash32_Process(uint32_t *ph1, uint32_t *pcarry, const void *key, int len)
+{
+  uint32_t h1 = *ph1;
+  uint32_t c = *pcarry;
+
+  const uint8_t *ptr = (uint8_t*)key;
+  const uint8_t *end;
+
+  /* Extract carry count from low 2 bits of c value */
+  int n = c & 3;
+
+#if defined(UNALIGNED_SAFE)
+  /* This CPU handles unaligned word access */
+
+  /* Consume any carry bytes */
+  int i = (4-n) & 3;
+  if(i && i <= len) {
+    DOBYTES(i, h1, c, n, ptr, len);
+  }
+
+  /* Process 32-bit chunks */
+  end = ptr + len/4*4;
+  for( ; ptr < end ; ptr+=4) {
+    uint32_t k1 = READ_UINT32(ptr);
+    DOBLOCK(h1, k1);
+  }
+
+#else /*UNALIGNED_SAFE*/
+  /* This CPU does not handle unaligned word access */
+
+  /* Consume enough so that the next data byte is word aligned */
+  int i = -(long)ptr & 3;
+  if(i && i <= len) {
+      DOBYTES(i, h1, c, n, ptr, len);
+  }
+
+  /* We're now aligned. Process in aligned blocks. Specialise for each possible carry count */
+  end = ptr + len/4*4;
+  switch(n) { /* how many bytes in c */
+  case 0: /* c=[----]  w=[3210]  b=[3210]=w            c'=[----] */
+    for( ; ptr < end ; ptr+=4) {
+      uint32_t k1 = READ_UINT32(ptr);
+      DOBLOCK(h1, k1);
+    }
+    break;
+  case 1: /* c=[0---]  w=[4321]  b=[3210]=c>>24|w<<8   c'=[4---] */
+    for( ; ptr < end ; ptr+=4) {
+      uint32_t k1 = c>>24;
+      c = READ_UINT32(ptr);
+      k1 |= c<<8;
+      DOBLOCK(h1, k1);
+    }
+    break;
+  case 2: /* c=[10--]  w=[5432]  b=[3210]=c>>16|w<<16  c'=[54--] */
+    for( ; ptr < end ; ptr+=4) {
+      uint32_t k1 = c>>16;
+      c = READ_UINT32(ptr);
+      k1 |= c<<16;
+      DOBLOCK(h1, k1);
+    }
+    break;
+  case 3: /* c=[210-]  w=[6543]  b=[3210]=c>>8|w<<24   c'=[654-] */
+    for( ; ptr < end ; ptr+=4) {
+      uint32_t k1 = c>>8;
+      c = READ_UINT32(ptr);
+      k1 |= c<<24;
+      DOBLOCK(h1, k1);
+    }
+  }
+#endif /*UNALIGNED_SAFE*/
+
+  /* Advance over whole 32-bit chunks, possibly leaving 1..3 bytes */
+  len -= len/4*4;
+
+  /* Append any remaining bytes into carry */
+  DOBYTES(len, h1, c, n, ptr, len);
+
+  /* Copy out new running hash and carry */
+  *ph1 = h1;
+  *pcarry = (c & ~0xff) | n;
+} 
+
+/*---------------------------------------------------------------------------*/
+
+/* Finalize a hash. To match the original Murmur3A the total_length must be provided */
+uint32_t PMurHash32_Result(uint32_t h, uint32_t carry, uint32_t total_length)
+{
+  uint32_t k1;
+  int n = carry & 3;
+  if(n) {
+    k1 = carry >> (4-n)*8;
+    k1 *= C1; k1 = ROTL32(k1,15); k1 *= C2; h ^= k1;
+  }
+  h ^= total_length;
+
+  /* fmix */
+  h ^= h >> 16;
+  h *= 0x85ebca6b;
+  h ^= h >> 13;
+  h *= 0xc2b2ae35;
+  h ^= h >> 16;
+
+  return h;
+}
+
+/*---------------------------------------------------------------------------*/
+
+/* Murmur3A compatable all-at-once */
+uint32_t PMurHash32(uint32_t seed, const void *key, int len)
+{
+  uint32_t h1=seed, carry=0;
+  PMurHash32_Process(&h1, &carry, key, len);
+  return PMurHash32_Result(h1, carry, len);
+}
diff --git a/compat/PMurHash.h b/compat/PMurHash.h
new file mode 100644
index 0000000000..0ee645669e
--- /dev/null
+++ b/compat/PMurHash.h
@@ -0,0 +1,62 @@
+/*-----------------------------------------------------------------------------
+ * MurmurHash3 was written by Austin Appleby, and is placed in the public
+ * domain.
+ *
+ * This implementation was written by Shane Day, and is also public domain.
+ *
+ * This is a portable ANSI C implementation of MurmurHash3_x86_32 (Murmur3A)
+ * with support for progressive processing.
+ */
+
+/* ------------------------------------------------------------------------- */
+/* Determine what native type to use for uint32_t */
+
+/* We can't use the name 'uint32_t' here because it will conflict with
+ * any version provided by the system headers or application. */
+
+/* First look for special cases */
+#if defined(_MSC_VER)
+  #define MH_UINT32 unsigned long
+#endif
+
+/* If the compiler says it's C99 then take its word for it */
+#if !defined(MH_UINT32) && ( \
+     defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L )
+  #include <stdint.h>
+  #define MH_UINT32 uint32_t
+#endif
+
+/* Otherwise try testing against max value macros from limit.h */
+#if !defined(MH_UINT32)
+  #include  <limits.h>
+  #if   (USHRT_MAX == 0xffffffffUL)
+    #define MH_UINT32 unsigned short
+  #elif (UINT_MAX == 0xffffffffUL)
+    #define MH_UINT32 unsigned int
+  #elif (ULONG_MAX == 0xffffffffUL)
+    #define MH_UINT32 unsigned long
+  #endif
+#endif
+
+#if !defined(MH_UINT32)
+  #error Unable to determine type name for unsigned 32-bit int
+#endif
+
+/* I'm yet to work on a platform where 'unsigned char' is not 8 bits */
+#define MH_UINT8  unsigned char
+
+
+/* ------------------------------------------------------------------------- */
+/* Prototypes */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+void PMurHash32_Process(MH_UINT32 *ph1, MH_UINT32 *pcarry, const void *key, int len);
+MH_UINT32 PMurHash32_Result(MH_UINT32 h1, MH_UINT32 carry, MH_UINT32 total_length);
+MH_UINT32 PMurHash32(MH_UINT32 seed, const void *key, int len);
+
+#ifdef __cplusplus
+}
+#endif

From patchwork Fri May 29 08:50:22 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578143
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9C74E912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:31 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 8421B20776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:31 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="TB3mNrCm"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726934AbgE2Iva (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:30 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44408 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726923AbgE2Iv1 (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:27 -0400
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 828E0C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:26 -0700 (PDT)
Received: by mail-wr1-x443.google.com with SMTP id e1so2572640wrt.5
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=Md1ng4Rb3G8KOgVu8ZWyMssm3NvuqmttT7pdRQr6Zxc=;
        b=TB3mNrCmnYFlp+CFKzbbMazmEP6Ouglt/2YxX6NmY/lQrRD1ZjS8mLOaI9mCpLTUn0
         1R4RyoF2DZPVZ3GXywmcYxfsyyi9gNdSzpOrua16+GhWR8eEXZykl8OxI/6c5UcR0at9
         8fExYZwPd0B0YpXPQuR54ZcS6UYWZVx2wd7fjnpEw5FoP1bk4HL2iR4zTXTu/5SOLJ1y
         HTbq48aI+4CoFpqcoePBR3yLktcaxgdv5XykbQ7jRaFILooXMOwq96A1vqXONbjGoFUO
         BUHTqhm180zoQqPz/8SGUM3flV6xKmCbfC2S0RnOjBlwAQeClo/DSkxUHPRlgo8IBjDg
         FeAg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=Md1ng4Rb3G8KOgVu8ZWyMssm3NvuqmttT7pdRQr6Zxc=;
        b=gCsWsFww4i24+PHSQYHGSVTgTFxXK+qs8YC8ydB/28SssEBCaoKL4Uoq3qCBmI3UIU
         5VQs6qvzf+N+/XUTicDYehZzB4mOwvfquDEJ0BxcziXs9wuAENQ0svirVD/PdmMI4CwY
         KNvItcAX/1znshotbYnLWLp6qKjFllNwFhf89OC7cmMhwil38cMK0PhWhgL5FLD4MgDc
         sUx0CzvsOPVLGdjFijKOAmGwBWDPAEU/Ug+dfrtOu5wv1N5Q7HB4syMFUJxV8G2M6bc2
         c/fyV9YZeDV7I/m8H5pSN1gXWbZI/QnPR7AoHEiqHeqeAeXp3hNToSQ6Q1rqzJh873D2
         rBHA==
X-Gm-Message-State: AOAM533HTtxzIc8dmaAZc3EAYbp5tqYtvOUc29rXWrjeHUB5ibK3KkJp
        K+CMlSCA/6KrDPofX57wRg1kUn15
X-Google-Smtp-Source: 
 ABdhPJzGhSIRJ6yMH9Pwjx7cfjfOM/HQkQoLabogi1DOz53Eq5a+7faV7qxM4UN50ojzpaiI4wSC7Q==
X-Received: by 2002:a5d:6391:: with SMTP id p17mr8222001wru.118.1590742284882;
        Fri, 29 May 2020 01:51:24 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.23
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:23 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter
 Index chunk
Date: Fri, 29 May 2020 10:50:22 +0200
Message-Id: <20200529085038.26008-19-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Write the Modifed Path Bloom Filter Index chunk if the
'core.modifiedPathBloomFilters' configuration variable is set.  For
now store only "no Bloom filter for this commit" entries.

The format of the various modified path Bloom filter-related chunks
are not (yet) compatible with split commit-graph files, so ignore the
config setting and don't write modifed path Bloom filters with
'--split'.
---
 Documentation/config/core.txt |  6 ++++++
 commit-graph.c                | 40 +++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index 9e440b160d..820f7d3bcf 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -588,6 +588,12 @@ core.commitGraph::
 	to parse the graph structure of commits. Defaults to true. See
 	linkgit:git-commit-graph[1] for more information.
 
+core.modifiedPathBloomFilters::
+	EXPERIMENTAL!!
+	If true, then Git will store and use Bloom filters in the
+	commit-graph file to speed up pathspec-limited revision walks.
+	Defaults to false.
+
 core.useReplaceRefs::
 	If set to `false`, behave as if the `--no-replace-objects`
 	option was given on the command line. See linkgit:git[1] and
diff --git a/commit-graph.c b/commit-graph.c
index e3adffd58b..28f147a418 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -21,6 +21,7 @@
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
+#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX 0x4d504249 /* "MPBI" */
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -33,6 +34,10 @@
 
 #define GRAPH_LAST_EDGE 0x80000000
 
+#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE 0xffffffffffffffff
+
+#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES 7
+
 #define GRAPH_HEADER_SIZE 8
 #define GRAPH_FANOUT_SIZE (4 * 256)
 #define GRAPH_CHUNKLOOKUP_WIDTH 12
@@ -791,6 +796,11 @@ struct write_commit_graph_context {
 		 check_oids:1;
 
 	const struct split_commit_graph_opts *split_opts;
+
+	struct modified_path_bloom_filter_context {
+		unsigned use_modified_path_bloom_filters:1;
+		unsigned int num_hashes;
+	} mpbfctx;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -990,6 +1000,20 @@ static int write_graph_chunk_extra_edges(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f,
+		struct write_commit_graph_context *ctx)
+{
+	const uint64_t no_bloom_filter = GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE;
+	int i;
+
+	hashwrite(f, &ctx->mpbfctx.num_hashes, sizeof(uint8_t));
+	for (i = 0; i < ctx->commits.nr; i++) {
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+		hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter));
+	}
+	return 0;
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1428,6 +1452,14 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		chunks[chunks_nr].write_fn = write_graph_chunk_extra_edges;
 		chunks_nr++;
 	}
+	if (ctx->mpbfctx.use_modified_path_bloom_filters) {
+		ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
+		chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX;
+		chunks[chunks_nr].size = sizeof(uint8_t) +
+					 ctx->commits.nr * sizeof(uint64_t);
+		chunks[chunks_nr].write_fn = write_graph_chunk_modified_path_bloom_index;
+		chunks_nr++;
+	}
 	if (ctx->num_commit_graphs_after > 1) {
 		ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
 		chunks[chunks_nr].id = GRAPH_CHUNKID_BASE;
@@ -1824,6 +1856,14 @@ int write_commit_graph(const char *obj_dir,
 	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
 	ctx->split_opts = split_opts;
 
+	git_config_get_bool("core.modifiedPathBloomFilters", &res);
+	if (res && ctx->split) {
+		warning("not writing modified path Bloom filters with --split");
+	} else if (res) {
+		ctx->mpbfctx.use_modified_path_bloom_filters = 1;
+		ctx->mpbfctx.num_hashes = GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES;
+	}
+
 	if (ctx->split) {
 		struct commit_graph *g;
 		prepare_commit_graph(ctx->r);

From patchwork Fri May 29 08:50:23 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578145
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1C981739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:34 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 051F1207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:34 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Fo5NESDN"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726953AbgE2Ivd (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:33 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44412 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726911AbgE2Iv2 (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:28 -0400
Received: from mail-wm1-x342.google.com (mail-wm1-x342.google.com
 [IPv6:2a00:1450:4864:20::342])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10420C08C5C8
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:28 -0700 (PDT)
Received: by mail-wm1-x342.google.com with SMTP id l26so2319922wme.3
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=DkYxr9eR1tUXrGS2oYdrcl4hMsAHKOufUaJa6EKXO5I=;
        b=Fo5NESDNXHWc+n2oMn6WaNNDaIS4VmZGOz5zEb2WSR47ezxMlQZk7br1A1x2tDNShC
         fFShTja5SWLpuoSjUb0+ZIOyZqLKtzf9mUI+WC85MGNKa1N/+EtC+SzrUz7WsK1qf1Dt
         9VIXuJMeQGoFnQguoelhsSNCXIWp/P/bN29cb2HzBkTtTLSHUCWB1RL4n7a3Rsyf7vSM
         W7lyFBIH3zXFAtpCVD3mv9y8qUNlzRzEZOiHX97CTc/PPX6QPKKYwPtXsc4/QXJsuiae
         hlq0wy1v0dId7tiwnCqI9Qaqo/Q8p7gnUDyIP47fKlbBzcW2WPL4kamZ8O5J7onJHTS0
         a3OQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=DkYxr9eR1tUXrGS2oYdrcl4hMsAHKOufUaJa6EKXO5I=;
        b=uacrlTFeCBbmQ9WplgfFU4ANteL7xpY+jwO7keoIEGgBP0Ovk3NBIt5z37FiMz5ovp
         nEtxe6X39r7KNFEPPZximd5NSDEZ1q6ur6i19OU5LC2UE9KRAV4rj/ODnim9GTSEczqu
         uQ4Vm/Vu5rp0coCMQx6Z0fcQLdq4nYRRxFbOf7Nd7rex81wVvX8EjB+2Y7UFFjZjKHs4
         jvMPHToYgBMAjQGTbDJsoPZwaS/qpiL7WQi685UyDOF2tuxSzVNLlC6HRmfkuotw4noo
         xiYcr+qjpgEitQei/YreA8Tr42lbnLTqh0vt77KEzWVgHYom7hGbFJJDPvwQ1NakygdO
         Q4iw==
X-Gm-Message-State: AOAM531Lwt8QFtJ/Iklg2nOiEbjojMAn2WnT3AizR2HqdLNyLhiVmVd4
        tLbJUMdHXokAVwYld0Q9PuepmGF9
X-Google-Smtp-Source: 
 ABdhPJwB5tFmMDqV2IXuCpw99VuWqTuyIxdhOc7nW3xuMo7/+PG9ZDBeQAkGvMQsIgx9QrflLeAkXw==
X-Received: by 2002:a1c:7305:: with SMTP id d5mr7758964wmb.85.1590742286491;
        Fri, 29 May 2020 01:51:26 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.24
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:25 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 19/34] commit-graph: add commit slab for modified path Bloom
 filters
Date: Fri, 29 May 2020 10:50:23 +0200
Message-Id: <20200529085038.26008-20-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

---
 commit-graph.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 28f147a418..3a38f25ce9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -14,6 +14,8 @@
 #include "hashmap.h"
 #include "replace-object.h"
 #include "progress.h"
+#include "bloom-filter.h"
+#include "commit-slab.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -47,6 +49,20 @@
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+struct modified_path_bloom_filter_info {
+	struct bloom_filter filter;
+};
+
+static void free_modified_path_bloom_filter_info_in_slab(
+		struct modified_path_bloom_filter_info *bfi)
+{
+	bloom_filter_free(&bfi->filter);
+}
+
+define_commit_slab(modified_path_bloom_filters,
+		   struct modified_path_bloom_filter_info);
+static struct modified_path_bloom_filters modified_path_bloom_filters;
+
 char *get_commit_graph_filename(const char *obj_dir)
 {
 	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
@@ -1008,8 +1024,18 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f,
 
 	hashwrite(f, &ctx->mpbfctx.num_hashes, sizeof(uint8_t));
 	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *commit = ctx->commits.list[i];
+		struct modified_path_bloom_filter_info *bfi;
+
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter));
+
+		bfi = modified_path_bloom_filters_peek(
+				&modified_path_bloom_filters, commit);
+
+		if (!bfi || !bfi->filter.nr_bits)
+			hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter));
+		else
+			BUG("writing proper Bloom filters is not implemented yet");
 	}
 	return 0;
 }
@@ -1864,6 +1890,9 @@ int write_commit_graph(const char *obj_dir,
 		ctx->mpbfctx.num_hashes = GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES;
 	}
 
+		init_modified_path_bloom_filters(&modified_path_bloom_filters);
+	}
+
 	if (ctx->split) {
 		struct commit_graph *g;
 		prepare_commit_graph(ctx->r);
@@ -1984,6 +2013,11 @@ int write_commit_graph(const char *obj_dir,
 		free(ctx->commit_graph_hash_after);
 	}
 
+	if (ctx->mpbfctx.use_modified_path_bloom_filters)
+		deep_clear_modified_path_bloom_filters(
+				&modified_path_bloom_filters,
+				free_modified_path_bloom_filter_info_in_slab);
+
 	free(ctx);
 
 	return res;

From patchwork Fri May 29 08:50:24 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578147
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3449E912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:35 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 17C74207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:35 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="BaY3DwvH"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726943AbgE2Ivc (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:32 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44408 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726936AbgE2Iva (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:30 -0400
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C2520C08C5C6
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:29 -0700 (PDT)
Received: by mail-wr1-x443.google.com with SMTP id x6so2471964wrm.13
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=GpLDdL2b7e0wU0xNiePpZD93QK43SHlYK9zMg82PJXk=;
        b=BaY3DwvH1S00lz4xfuSa03DmPMyvKR0ERL3mChdfQjTVBm+1CKY2zzWbh3k1TRRKYe
         +lC8XdTWxuAh6l97HVJwAQJjt9kJDdeeJejLGSmFV6IK0XocGYIP42WLeq1mYhuTcZ4n
         ILxErlbrp52z8r29fErKsPrdfqr4k9OyFXOUvUnqq9FwGcrTVmzIv3ql9TzsRxvZkEgY
         yvv646g8JDnFVjqcz9JvNZPSS4/o1jJQ+0BSsc+pP7Hv9slfw1GuqEbuhPUGhCQOPslR
         mOFQCkc2OLdP6l53Hss3zQ6gbwx0q8DqDHt5pXbRCMaIeGZF7rBCH4EW5s3mUJjb3wGJ
         wJTQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=GpLDdL2b7e0wU0xNiePpZD93QK43SHlYK9zMg82PJXk=;
        b=CInM12YxhYgQcoVS2l6HLxxo+KRfbUdDnmtOKKMxiO5CUM2mv4WEjCzzPBL099ZyXh
         NVzEjncmWhWSGImKd2IFRbeyUDR+cpoSNgJAEdcEJ6HoxOF2JB+xc6ZjS/lpFKspp3z/
         cysIgA8ghc4ooQAql5jocSlCOje9+UIOmibmEDUaekNt6ySG33Oe8KUzDduGRilcWBse
         8b81Olwp0jm3L6oES0xCuGMew86oYm4DagTatFITNtIjcOT9T0jYSLVgSeJp/1IZxPhR
         qoXOASJtVNv64KPW/4txuPS8A/XEe2YKz5RJfKzz6oRwIku1j1BuAuhIYH6pgtRVUr0z
         //OQ==
X-Gm-Message-State: AOAM533RKK3g0cy01phNbQJYHsmaQEp7HK/90uA+WGWbITKm1ohAso7x
        9k6ACDDWcR/SvLCSggKIzz7qMYAg
X-Google-Smtp-Source: 
 ABdhPJyvqgYnpHa3JxKuHFpbXEBYz1hjvoX5O0yNs8bfMLJYlqyBAb8IGUD6c9WGFq5zvgWwS5tn2Q==
X-Received: by 2002:a05:6000:12c4:: with SMTP id
 l4mr8188626wrx.128.1590742287823;
        Fri, 29 May 2020 01:51:27 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.26
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:27 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index
 chunk
Date: Fri, 29 May 2020 10:50:24 +0200
Message-Id: <20200529085038.26008-21-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Collect all paths modified by all commits, and write embedded modified
path Bloom filters for those commits that modify few enough paths to
fit into a 63 bit array using 7 hashes per path, i.e. 6 or less.

For now all modified path Bloom filters are created from scratch,
re-using filters from the existing commit-graph file will be done in a
later patch in this series.

Plug this into the custom revision walking loop in close_reachable(),
so the iteration is at least somewhat friendly to the delta base
cache, and we won't have yet another loop iterating over all commits.
Take care care that this loop might encounter the same commits
multiple times (e.g. because more than one refs point to the same
commit, or a commit is present in more than one pack file), and skip
creating a modified path Bloom filter for already processed commits.
We wouldn't need to be careful about this if we were to create
modified path Bloom filters a bit later on, after the list of commits
have been deduplicated, but at that point the commits are in SHA-1
order, i.e. effectively random, which is downright hostile to the
caches.

Perhaps processing the modified paths returned by the tree-diff
machinery turned out to be more complicated than strictly necessary,
but...

So, my initial approach was to first let diff_tree_oid() and
diffcore_std() do their thing and put all modified files into a diff
queue, then iterate over that queue and add all paths (i.e. leading
directories as well) to a hashmap with a suitable comparison function
for deduplication, use the number of elements in that hashmap to
properly size the Bloom filter's bit array, and finally iterate over
all paths in the hashmap and add them to the Bloom filter.  Simple,
short, straightforward, almost idiomatic even, without subtle corner
cases.

This approach, however, involved lot of memory allocations: three
allocations for each entry in the diff queue (IOW for each modified
file), and one allocation for each hashmap entry (IOW for each
modified path), not counting allocations for the hashmap's internals.
All these memory allocations caused considerable overhead.

This patch implements a more efficient, but more complex, approach:
register callback functions in 'struct diff_options' for add/removal
and modification, which process, deduplicate, and hash modified paths
_while_ the tree-diff machinery scans and compares the two trees.
Tree-diff calls these callback functions with the filenames in order
[1], so the deduplication of modified paths basically boils down to
skipping the common prefix with the previously processed path (and
some subtlety in case of dir-file replacement, as explained in an
in-code comment).  This way we don't do any memory allocations for
each modified file or path while processing the diff output, and the
memory allocations that are still left are amortized over the whole
commit-graph writing process.

[1] The tree-diff machinery would call its callbacks with out of order
    filenames if it encountered a bogus tree object with unsorted
    entries.  This doesn't affect the correctness of modified
    path Bloom filters only their sizes: the worst thing that can
    happen is that this approach would count and hash some paths more
    than once, thus Bloom filters containing paths modified compared
    to such a bogus tree object would end up bigger than expected.
    And bogus unsorted tree object cause all kinds of weird diff
    issues anyway...

[TODO: The pathchange callback is no good.]

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 238 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3a38f25ce9..413605a29c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,8 @@
 #include "progress.h"
 #include "bloom-filter.h"
 #include "commit-slab.h"
+#include "diff.h"
+#include "compat/PMurHash.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -37,8 +39,10 @@
 #define GRAPH_LAST_EDGE 0x80000000
 
 #define GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE 0xffffffffffffffff
+#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS 63
 
 #define GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES 7
+#define GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_EMBEDDED_LIMIT 6
 
 #define GRAPH_HEADER_SIZE 8
 #define GRAPH_FANOUT_SIZE (4 * 256)
@@ -773,6 +777,36 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
+static uint32_t modified_path_bloom_seeds[] = {
+	0xe83c5163U,
+	0x3b376b0cU,
+	0x72441af7U,
+	0x433860f3U,
+	0x6d9617f4U,
+	0x22705334U,
+	0xc8af66abU,
+	0x43b44ccfU,
+	0xb84f767cU,
+	0x1da177e4U
+};
+
+static void compute_modified_path_bloom_hashes_for_path(const char *path,
+		size_t len, unsigned int num_hashes, uint32_t *hashes)
+{
+	uint32_t h1, h2, i;
+
+	h1 = PMurHash32(modified_path_bloom_seeds[0], path, len);
+	h2 = PMurHash32(modified_path_bloom_seeds[1], path, len);
+
+	/* Equivalent to hashes[i] = h1 + i * h2 + (i * i * i - i) / 6 */
+	hashes[0] = h1;
+	for (i = 1; i < num_hashes; i++) {
+		h1 += h2;
+		h2 += i;
+		hashes[i] = h1;
+	}
+}
+
 struct packed_commit_list {
 	struct commit **list;
 	int nr;
@@ -816,6 +850,29 @@ struct write_commit_graph_context {
 	struct modified_path_bloom_filter_context {
 		unsigned use_modified_path_bloom_filters:1;
 		unsigned int num_hashes;
+		/*
+		 * Number of paths to be added to "embedded" modified path
+		 * Bloom filters.
+		 */
+		int embedded_limit;
+
+		struct diff_options diffopt;
+
+		/*
+		 * Used to deduplicate paths modified by the currenly
+		 * processed commit.
+		 */
+		struct strbuf prev_path;
+		size_t *hashed_prefix_lengths;
+		int hashed_prefix_lengths_nr, hashed_prefix_lengths_alloc;
+
+		/*
+		 * The tree-diff callbacks will fill this array with
+		 * 'num_hashes' hash values for each path modified by the
+		 * currently processed commit.
+		 */
+		uint32_t *hashes;
+		int hashes_nr, hashes_alloc;
 	} mpbfctx;
 };
 
@@ -1032,10 +1089,20 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f,
 		bfi = modified_path_bloom_filters_peek(
 				&modified_path_bloom_filters, commit);
 
-		if (!bfi || !bfi->filter.nr_bits)
+		if (!bfi || !bfi->filter.nr_bits) {
 			hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter));
-		else
-			BUG("writing proper Bloom filters is not implemented yet");
+		} else if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) {
+			uint8_t filterdata[sizeof(uint64_t)];
+			memcpy(filterdata, bfi->filter.bits,
+			       sizeof(filterdata));
+			/*
+			 * Set the most significant bit of the first byte to
+			 * indicate embedded modified path Bloom Filter.
+			 */
+			filterdata[0] |= 1 << 7;
+			hashwrite(f, filterdata, sizeof(filterdata));
+		} else
+			BUG("writing non-embedded Bloom filters is not implemented yet");
 	}
 	return 0;
 }
@@ -1074,9 +1141,162 @@ static int add_packed_commits(const struct object_id *oid,
 	return 0;
 }
 
+static void handle_modified_file(
+		struct modified_path_bloom_filter_context *mpbfctx,
+		const char *path)
+{
+	const char *prev = mpbfctx->prev_path.buf;
+	const char *p = path;
+	size_t common_prefix_len, already_hashed_prefix_len = 0;
+	int i;
+
+	/*
+	 * We want to hash each modified path only once, i.e. if a commit
+	 * modifies both 'dir/foo' and 'dir/bar', then we want to hash
+	 * 'dir' only once.
+	 */
+
+	/*
+	 * Skip common prefix with the previously handled path, so we
+	 * hash any common leading directories only once.
+	 */
+	while (*prev && *p && *prev == *p) {
+		prev++;
+		p++;
+	}
+	common_prefix_len = p - path;
+
+	/*
+	 * However, merely skipping the common prefix is not enough.
+	 *
+	 * If a commit removes a file (symlink, or submodule) and adds a
+	 * directory with the same name (or vice versa), e.g. removes
+	 * 'foo' and adds 'foo/bar', and adds/modifies/removes 'foo.c',
+	 * then the tree-diff machinery will call this function with
+	 * those paths in the following order:
+	 *
+	 *   foo
+	 *   foo.c
+	 *   foo/bar
+	 *
+	 * So the first call hashes 'foo', the second hashes 'foo.c', so
+	 * far so good.  In the third call the common prefix of 'foo.c'
+	 * and 'foo/bar' is 'foo', and then an immediate strchrnul(p, '/')
+	 * would find the subsequent '/' and would hash 'foo' _again_.
+	 * Not so good.
+	 *
+	 * To avoid this we maintain the length of already hashed common
+	 * prefixes and skip one more character (i.e. the '/') if
+	 * necessary.
+	 */
+	for (i = 0; i < mpbfctx->hashed_prefix_lengths_nr; i++) {
+		if (common_prefix_len < mpbfctx->hashed_prefix_lengths[i]) {
+			mpbfctx->hashed_prefix_lengths_nr = i;
+			break;
+		}
+
+		already_hashed_prefix_len = mpbfctx->hashed_prefix_lengths[i];
+	}
+	if (common_prefix_len == already_hashed_prefix_len)
+		p = path + already_hashed_prefix_len + 1;
+
+	do {
+		unsigned int new_hashes_nr = mpbfctx->hashes_nr + mpbfctx->num_hashes;
+
+		p = strchrnul(p, '/');
+
+		ALLOC_GROW(mpbfctx->hashes, new_hashes_nr,
+			   mpbfctx->hashes_alloc);
+		compute_modified_path_bloom_hashes_for_path(path, p - path,
+				mpbfctx->num_hashes,
+				mpbfctx->hashes + mpbfctx->hashes_nr);
+		mpbfctx->hashes_nr = new_hashes_nr;
+
+		ALLOC_GROW(mpbfctx->hashed_prefix_lengths,
+			   mpbfctx->hashed_prefix_lengths_nr + 1,
+			   mpbfctx->hashed_prefix_lengths_alloc);
+		mpbfctx->hashed_prefix_lengths[mpbfctx->hashed_prefix_lengths_nr] = p - path;
+		mpbfctx->hashed_prefix_lengths_nr++;
+	} while (*p && p++);
+
+	/*
+	 * Update the previous path for the next iteration, copying only
+	 * as many characters as necessary.
+	 */
+	strbuf_setlen(&mpbfctx->prev_path, common_prefix_len);
+	strbuf_add(&mpbfctx->prev_path, path + common_prefix_len,
+		   p - (path + common_prefix_len));
+}
+
+static void file_add_remove_cb(struct diff_options *options,
+			       int addremove, unsigned mode,
+			       const struct object_id *oid,
+			       int oid_valid, const char *fullpath,
+			       unsigned dirty_submodule)
+{
+	handle_modified_file(options->change_fn_data, fullpath);
+}
+
+static void file_change_cb(struct diff_options *options,
+			   unsigned old_mode, unsigned new_mode,
+			   const struct object_id *old_oid,
+			   const struct object_id *new_oid,
+			   int old_oid_valid, int new_oid_valid,
+			   const char *fullpath,
+			   unsigned old_dirty_submodule,
+			   unsigned new_dirty_submodule)
+{
+	handle_modified_file(options->change_fn_data, fullpath);
+}
+
+static void create_modified_path_bloom_filter(
+		struct modified_path_bloom_filter_context *mpbfctx,
+		struct commit *commit)
+{
+	struct modified_path_bloom_filter_info *bfi;
+	struct object_id *parent_oid;
+	uintmax_t path_component_count;
+
+	if (!mpbfctx->use_modified_path_bloom_filters)
+		return;
+
+	/*
+	 * This function can be called multiple times for the same commit
+	 * (a commit can be present in multiple pack files, or multiple
+	 * refs can point to the same commit) so check first whether we have
+	 * already created a modified path Bloom filter for it.
+	 */
+	bfi = modified_path_bloom_filters_peek(&modified_path_bloom_filters,
+					       commit);
+	if (bfi && bfi->filter.nr_bits)
+		return;
+
+	parent_oid = commit->parents ? &commit->parents->item->object.oid :
+				       NULL;
+	mpbfctx->hashes_nr = 0;
+	strbuf_reset(&mpbfctx->prev_path);
+	diff_tree_oid(parent_oid, &commit->object.oid, "", &mpbfctx->diffopt);
+	path_component_count = mpbfctx->hashes_nr / mpbfctx->num_hashes;
+
+	if (path_component_count > mpbfctx->embedded_limit) {
+		/* Not implemented yet. */
+	} else {
+		bfi = modified_path_bloom_filters_at(
+				&modified_path_bloom_filters, commit);
+
+		bloom_filter_init_with_size(&bfi->filter,
+				GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS);
+		bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes,
+				      mpbfctx->hashes_nr);
+	}
+}
+
 static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit)
 {
 	struct commit_list *parent;
+
+	create_modified_path_bloom_filter(&ctx->mpbfctx, commit);
+
 	for (parent = commit->parents; parent; parent = parent->next) {
 		if (!(parent->item->object.flags & REACHABLE)) {
 			ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
@@ -1888,9 +2108,18 @@ int write_commit_graph(const char *obj_dir,
 	} else if (res) {
 		ctx->mpbfctx.use_modified_path_bloom_filters = 1;
 		ctx->mpbfctx.num_hashes = GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES;
-	}
+		ctx->mpbfctx.embedded_limit = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS / (ctx->mpbfctx.num_hashes * 10 / 7);
 
 		init_modified_path_bloom_filters(&modified_path_bloom_filters);
+
+		repo_diff_setup(ctx->r, &ctx->mpbfctx.diffopt);
+		ctx->mpbfctx.diffopt.flags.recursive = 1;
+		ctx->mpbfctx.diffopt.add_remove = file_add_remove_cb;
+		ctx->mpbfctx.diffopt.change = file_change_cb;
+		ctx->mpbfctx.diffopt.change_fn_data = &ctx->mpbfctx;
+		diff_setup_done(&ctx->mpbfctx.diffopt);
+
+		strbuf_init(&ctx->mpbfctx.prev_path, PATH_MAX);
 	}
 
 	if (ctx->split) {
@@ -2013,10 +2242,14 @@ int write_commit_graph(const char *obj_dir,
 		free(ctx->commit_graph_hash_after);
 	}
 
-	if (ctx->mpbfctx.use_modified_path_bloom_filters)
+	if (ctx->mpbfctx.use_modified_path_bloom_filters) {
 		deep_clear_modified_path_bloom_filters(
 				&modified_path_bloom_filters,
 				free_modified_path_bloom_filter_info_in_slab);
+		strbuf_release(&ctx->mpbfctx.prev_path);
+		free(ctx->mpbfctx.hashed_prefix_lengths);
+		free(ctx->mpbfctx.hashes);
+	}
 
 	free(ctx);
 

From patchwork Fri May 29 08:50:25 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578149
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8AE0E912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:44 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 6662120776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:44 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="HjlPcBWx"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726965AbgE2Ivl (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:41 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44424 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726393AbgE2Ivc (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:32 -0400
Received: from mail-wr1-x441.google.com (mail-wr1-x441.google.com
 [IPv6:2a00:1450:4864:20::441])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B6173C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:31 -0700 (PDT)
Received: by mail-wr1-x441.google.com with SMTP id e1so2572888wrt.5
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=Bmrij8QsJAHFw41Cp840lVkMmesXohpmVoiWgCWgeao=;
        b=HjlPcBWxmW+veH4Icl1Fspe4bwuHQJZUz27zUXA4YLjDgsT+kdWxViVBaWwb66hBB/
         YGC7mUxlM0LNUjXx0vW4167jTdAFYTfZHe83d+ChnmKZ3ooMpaQLb8wJ3Icl8665CB6u
         btzIZSE0elVRpi16qAIlNF9AgWdVJLcF4ymRD1R1GOO6rkchAZJI0LlKL/ag3QLygDHn
         7MIi6xC68jLw1/RgpsABMCLni0W/2Q5BOvDblaxx1IHRItmzOxxdMz7JUHw3CAohdKa6
         JaWqjBpxjWGyaBp2UsKU+i1hTLngOM+og6okt4V5QkhJQQfHA8Qmu8ULWyJXqqtIK3MV
         hLJQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=Bmrij8QsJAHFw41Cp840lVkMmesXohpmVoiWgCWgeao=;
        b=s11JJcvyBVSehLTQnP6TZSXBc/8aB7+/AH/cNrtQXVJHQkmvPxaAxI1ZO2TYMkP1ue
         1iJRN7Tq50w0NeHY56wUFnrCF3r+LXwazN4zXozKDeTTX7kyPpMruR/xDgSfOsFrJaHD
         hX6Eyu7Ti2+STMMNPmq7aWRhtEHvOGH7sEA25oWGpVa3dCywPLW4BwEHOCc/llsQ8iIL
         osZoZGEtBr7giLdkDVAB3V+BZi2u9o5hiaTcJSOPNTLLtoImQX3rsq9dFPsFlJtr6bZQ
         sw+DXJ/ICwMcdorMUE91WuNoHewmaFgoj/qzm0O+QuD8p1quwKJ2DZak8kMooiqr2MxE
         utVw==
X-Gm-Message-State: AOAM532gtvYTfaSvMxU9q8ID31NzT/j5fDkr24kmnlZGbyvA719Hhtoj
        rhd7S5IXM+uDMPl7qzFpzhLLcsVH
X-Google-Smtp-Source: 
 ABdhPJyRQBkAx38fSxkd65wViqRe1yS52yutdik9Nm6SMg18qxeIIOF1DO8iOI/Vx8EfNIvIudQH8g==
X-Received: by 2002:a5d:6550:: with SMTP id z16mr8211619wrv.392.1590742289715;
        Fri, 29 May 2020 01:51:29 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.28
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:28 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 21/34] commit-graph: load and use the Modified Path Bloom
 Filter Index chunk
Date: Fri, 29 May 2020 10:50:25 +0200
Message-Id: <20200529085038.26008-22-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Load the Modified Path Bloom Filter Index (MPBI) chunk and check the
embedded modified path Bloom filters within to speed up
pathspec-limited revison walks.

Extend 'struct pathspec_item' to hold the hashes of each path for
modified path Bloom filters.  Pathspecs are used by many Git commands,
and a lot of them doesn't do any revision walk at all, so don't
initialize those fields in parse_pathspec(), but only when we are
about to start a pathspec-limited revision walk, i.e. in
prepare_revision_walk().  Initialize these fields only in the
revs->pruning.pathspec instance, because that's the one used to limit
the revision walk.  Note that some revision walks re-parse the
pathspecs, notably 'git log --follow' and the line-level log, but they
don't use this pathspec instance.

The table below shows the average runtime and speedup of

  git rev-list HEAD -- "$path"

for 5000+ randomly selected paths from each repository with and
without modified path Bloom filters:

                   Average runtime
                  without      with
                    MPBI       MPBI
  --------------------------------------------
  android-base    0.8780s      n/a
  cmssw           0.3143s    0.2104s    -33.1%
  cpython         0.7453s    0.3410s    -54.2%
  gcc             7.1852s    3.3633s    -53.2%
  gecko-dev       4.6113s      n/a
  git             0.6180s    0.2184s    -64.7%
  glibc           0.5618s    0.3245s    -42.2%
  jdk             0.0482s    0.0395s    -18.0%
  linux           0.7043s    0.4810s    -31.6%
  llvm-project    2.6844s      n/a
  rails           0.2784s    0.1792s    -35.6%
  rust            0.7757s    0.6073s    -21.7%
  webkit          1.9137s    1.4832s    -22.5%

The results in the homebrew-core repository can be spectacular, though
that repository looks like as if it was specifically designed to show
off the benefits of this patch: over 98% of its almost 164k commits in
its linear history modify one file within the 'Formula' directory, and
that directory contains almost 5000 files.  Consequently, almost all
commits will have a 63 bit "embedded" modified path Bloom filter
containing only 2 paths, resulting in a false positive rate less than
0.1%, and they can spare a lot of expensive tree entry scanning to the
end of that directory.  So looking at the history of the first and
last files in that directory:

                         Runtime                       Max RSS
                   without     with              without      with
                     MPBI      MPBI   Speedup      MPBI       MPBI
  -----------------------------------------------------------------
  Formula/a2ps.rb   8.390s    0.218s    38.5x    422884k    129212k
  Formula/zzz.rb   80.495s    0.176s   450.4x    423744k     68364k

('Formula/a2ps.rb' is modified 32 times and has two false positives,
while 'zzz.rb' is modified only twice and happens to have only a
single false positive; that's why querying the history of the former
is slower, even though it's at the beginning of the directory).

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++++
 commit-graph.h |  13 +++++
 pathspec.c     |   9 +++
 pathspec.h     |  12 ++++
 revision.c     |  18 ++++++
 5 files changed, 198 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 413605a29c..fb24600bb3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -26,6 +26,7 @@
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
 #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX 0x4d504249 /* "MPBI" */
+#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES 0x4d504258 /* "MPBX" */
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -308,6 +309,23 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 				chunk_repeated = 1;
 			else
 				graph->chunk_base_graphs = data + chunk_offset;
+			break;
+
+		case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX:
+			if (graph->chunk_mpbf_index)
+				chunk_repeated = 1;
+			else {
+				graph->chunk_mpbf_index = data + chunk_offset;
+				graph->chunk_mpbf_index_size = next_chunk_offset - chunk_offset;
+			}
+			break;
+
+		case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES:
+			if (graph->chunk_mpbf_excludes)
+				chunk_repeated = 1;
+			else
+				graph->chunk_mpbf_excludes = data + chunk_offset;
+			break;
 		}
 
 		if (chunk_repeated) {
@@ -324,6 +342,29 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 		return NULL;
 	}
 
+	if (graph->chunk_mpbf_index) {
+		int res = 0;
+		git_config_get_bool("core.modifiedPathBloomFilters", &res);
+		if (res) {
+			uint64_t expected_size = sizeof(uint8_t) +
+						 graph->num_commits * sizeof(uint64_t);
+			if (graph->chunk_mpbf_index_size != expected_size)
+				BUG(_("for %u commits the Modified Path Bloom Filter Index chunk should be %ld bytes, but it's %ld"),
+				    graph->num_commits, expected_size,
+				    graph->chunk_mpbf_index_size);
+			graph->num_modified_path_bloom_hashes = *graph->chunk_mpbf_index;
+			if (!graph->num_modified_path_bloom_hashes)
+				BUG(_("Modified Path Bloom Filter Index chunk using 0 hashes"));
+			if (graph->chunk_mpbf_excludes)
+				/*
+				 * Modified Path Bloom Filter Excludes are
+				 * not supported yet.
+				 */
+				graph->chunk_mpbf_index = NULL;
+		} else
+			graph->chunk_mpbf_index = NULL;
+	}
+
 	return graph;
 }
 
@@ -777,6 +818,68 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
+static int load_modified_path_bloom_filter_from_graph(
+		struct commit_graph *graph, struct commit *commit,
+		struct commit *parent, struct bloom_filter *bf)
+{
+	const uint8_t *bloom_index;
+	int first_parent = 0;
+
+	if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
+		return 0;
+	if (!graph)
+		return 0;
+	if (!graph->chunk_mpbf_index)
+		return 0;
+
+	if (!commit->parents || commit->parents->item == parent)
+		first_parent = 1;
+
+	bloom_index = graph->chunk_mpbf_index + sizeof(uint8_t) +
+		      sizeof(uint64_t) * commit->graph_pos;
+
+	if (bloom_index[0] & (1 << 7)) {
+		uint64_t v;
+		if (!first_parent)
+			return 0;
+		memcpy(&v, bloom_index, sizeof(v));
+		if (v == GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE)
+			return 0;
+
+		/* embedded modified path Bloom filter */
+		bf->nr_bits = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS;
+		bf->bits = (uint8_t*) bloom_index;
+		return 1;
+	}
+	/* support for non-embedded Bloom filters is not implemented yet. */
+	return 0;
+}
+
+enum bloom_result check_modified_path_bloom_filter(struct repository *r,
+		struct pathspec *pathspec, struct commit *parent,
+		struct commit *commit)
+{
+	struct bloom_filter bf;
+	int i;
+
+	if (!pathspec->can_use_modified_path_bloom_filters)
+		return BLOOM_CANT_TELL;
+	if (!load_modified_path_bloom_filter_from_graph(r->objects->commit_graph,
+							commit, parent, &bf))
+		return BLOOM_CANT_TELL;
+
+	for (i = 0; i < pathspec->nr; i++) {
+		struct pathspec_item *pi = &pathspec->items[i];
+
+		if (bloom_filter_check_bits(&bf,
+					pi->modified_path_bloom_hashes,
+					pi->modified_path_bloom_hashes_nr))
+			return BLOOM_POSSIBLY_YES;
+	}
+
+	return BLOOM_DEFINITELY_NOT;
+}
+
 static uint32_t modified_path_bloom_seeds[] = {
 	0xe83c5163U,
 	0x3b376b0cU,
@@ -807,6 +910,49 @@ static void compute_modified_path_bloom_hashes_for_path(const char *path,
 	}
 }
 
+void init_pathspec_bloom_fields(struct repository *r,
+				struct pathspec *pathspec)
+{
+	const unsigned bloom_compatible_magic = PATHSPEC_LITERAL;
+	struct commit_graph *graph = r->objects->commit_graph;
+	int i;
+
+	if (!graph)
+		return;
+	if (!graph->chunk_mpbf_index)
+		return;
+	if (!pathspec->nr)
+		return;
+	if (pathspec->has_wildcard)
+		return;
+	if (pathspec->magic & ~bloom_compatible_magic)
+		return;
+
+	for (i = 0; i < pathspec->nr; i++) {
+		struct pathspec_item *pi = &pathspec->items[i];
+		const char *path = pi->match;
+		size_t len = pi->len;
+
+		/*
+		 * Pathspec parsing has normalized away any consecutive
+		 * slashes, but a trailing slash might still be present,
+		 * "remove" it.
+		 */
+		if (path[len - 1] == '/')
+			len--;
+
+		pi->modified_path_bloom_hashes_nr = graph->num_modified_path_bloom_hashes;
+		ALLOC_ARRAY(pi->modified_path_bloom_hashes,
+			    pi->modified_path_bloom_hashes_nr);
+
+		compute_modified_path_bloom_hashes_for_path(path, len,
+				graph->num_modified_path_bloom_hashes,
+				pi->modified_path_bloom_hashes);
+	}
+
+	pathspec->can_use_modified_path_bloom_filters = 1;
+}
+
 struct packed_commit_list {
 	struct commit **list;
 	int nr;
diff --git a/commit-graph.h b/commit-graph.h
index 847cd25cfc..09dfc16932 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -11,6 +11,8 @@ struct commit;
 struct repository;
 struct raw_object_store;
 struct string_list;
+struct bloom_filter;
+struct pathspec;
 
 char *get_commit_graph_filename(const char *obj_dir);
 int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
@@ -38,6 +40,12 @@ void load_commit_graph_info(struct repository *r, struct commit *item);
 struct tree *get_commit_tree_in_graph(struct repository *r,
 				      const struct commit *c);
 
+enum bloom_result check_modified_path_bloom_filter(struct repository *r,
+		struct pathspec *pathspec, struct commit *parent,
+		struct commit *commit);
+void init_pathspec_bloom_fields(struct repository *r,
+		struct pathspec *pathspec);
+
 struct commit_graph {
 	int graph_fd;
 
@@ -59,6 +67,11 @@ struct commit_graph {
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
+	const unsigned char *chunk_mpbf_index;
+	uint64_t chunk_mpbf_index_size;
+	const unsigned char *chunk_mpbf_excludes;
+
+	uint8_t num_modified_path_bloom_hashes;
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
diff --git a/pathspec.c b/pathspec.c
index 128f27fcb7..01e7ae6944 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -406,6 +406,8 @@ static void init_pathspec_item(struct pathspec_item *item, unsigned flags,
 	item->attr_check = NULL;
 	item->attr_match = NULL;
 	item->attr_match_nr = 0;
+	item->modified_path_bloom_hashes_nr = 0;
+	item->modified_path_bloom_hashes = NULL;
 
 	/* PATHSPEC_LITERAL_PATH ignores magic */
 	if (flags & PATHSPEC_LITERAL_PATH) {
@@ -674,6 +676,12 @@ void copy_pathspec(struct pathspec *dst, const struct pathspec *src)
 		}
 
 		d->attr_check = attr_check_dup(s->attr_check);
+
+		ALLOC_ARRAY(d->modified_path_bloom_hashes,
+			    d->modified_path_bloom_hashes_nr);
+		COPY_ARRAY(d->modified_path_bloom_hashes,
+			   s->modified_path_bloom_hashes,
+			   d->modified_path_bloom_hashes_nr);
 	}
 }
 
@@ -684,6 +692,7 @@ void clear_pathspec(struct pathspec *pathspec)
 	for (i = 0; i < pathspec->nr; i++) {
 		free(pathspec->items[i].match);
 		free(pathspec->items[i].original);
+		free(pathspec->items[i].modified_path_bloom_hashes);
 
 		for (j = 0; j < pathspec->items[i].attr_match_nr; j++)
 			free(pathspec->items[i].attr_match[j].value);
diff --git a/pathspec.h b/pathspec.h
index 454ce364fa..0c661e1b03 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -32,6 +32,7 @@ struct pathspec {
 	unsigned int has_wildcard:1;
 	unsigned int recursive:1;
 	unsigned int recurse_submodules:1;
+	unsigned int can_use_modified_path_bloom_filters:1;
 	unsigned magic;
 	int max_depth;
 	struct pathspec_item {
@@ -52,6 +53,17 @@ struct pathspec {
 			} match_mode;
 		} *attr_match;
 		struct attr_check *attr_check;
+
+		/*
+		 * These fields are only relevant during pathspec-limited
+		 * revision walks, init_pathspec_item() leaves them
+		 * zero-initialized, but copy_pathspec() copies them,
+		 * and clear_pathspec() releases the allocated memory.
+		 * IOW: they are only valid if 'struct pathspec's
+		 * 'can_use_modified_path_bloom_filters' bit is set.
+		 */
+		uint32_t modified_path_bloom_hashes_nr;
+		uint32_t *modified_path_bloom_hashes;
 	} *items;
 };
 
diff --git a/revision.c b/revision.c
index 9dac1262ef..6ec4bc0cb1 100644
--- a/revision.c
+++ b/revision.c
@@ -29,6 +29,7 @@
 #include "prio-queue.h"
 #include "hashmap.h"
 #include "utf8.h"
+#include "bloom-filter.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -629,6 +630,7 @@ static int rev_compare_tree(struct rev_info *revs,
 {
 	struct tree *t1 = get_commit_tree(parent);
 	struct tree *t2 = get_commit_tree(commit);
+	enum bloom_result bloom_ret;
 
 	if (!t1)
 		return REV_TREE_NEW;
@@ -653,6 +655,12 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
+	bloom_ret = check_modified_path_bloom_filter(revs->repo,
+						     &revs->pruning.pathspec,
+						     parent, commit);
+	if (bloom_ret == BLOOM_DEFINITELY_NOT)
+		return REV_TREE_SAME;
+
 	tree_difference = REV_TREE_SAME;
 	revs->pruning.flags.has_changes = 0;
 	diff_tree_oid(&t1->object.oid, &t2->object.oid, "", &revs->pruning);
@@ -662,10 +670,17 @@ static int rev_compare_tree(struct rev_info *revs,
 static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
 {
 	struct tree *t1 = get_commit_tree(commit);
+	enum bloom_result bloom_ret;
 
 	if (!t1)
 		return 0;
 
+	bloom_ret = check_modified_path_bloom_filter(revs->repo,
+						     &revs->pruning.pathspec,
+						     NULL, commit);
+	if (bloom_ret == BLOOM_DEFINITELY_NOT)
+		return 1;
+
 	tree_difference = REV_TREE_SAME;
 	revs->pruning.flags.has_changes = 0;
 	diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning);
@@ -3376,6 +3391,9 @@ int prepare_revision_walk(struct rev_info *revs)
 		simplify_merges(revs);
 	if (revs->children.name)
 		set_children(revs);
+
+	init_pathspec_bloom_fields(revs->repo, &revs->pruning.pathspec);
+
 	return 0;
 }
 

From patchwork Fri May 29 08:50:26 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578169
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1CDC6912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:59 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 0423020723
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:59 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="G/HRUI2M"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727034AbgE2Iv6 (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:58 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44434 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726911AbgE2Ivd (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:33 -0400
Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com
 [IPv6:2a00:1450:4864:20::444])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51352C08C5C8
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:33 -0700 (PDT)
Received: by mail-wr1-x444.google.com with SMTP id q11so2592764wrp.3
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=xq9RPwStefHvCymyxByn3fvYLabVFMEidu2zASdQcu0=;
        b=G/HRUI2MG3qzrPCZ1Iuo6DiUplG9ImjappnenKNxJQfkI1+B41wC7lBmF2QrJQgUbJ
         acwr0ofWGGUyYSLa/n2vVRf2x8byKu3MjuFy0xY/FOpnwHBuhQDgFixqLmtm/A/9MHIJ
         AM/noY/f41IAsrjABoF2VVi5zTSBuDgf0r8zmYm6G2aYoqDDrRH0UjctvAXlue3bSjX3
         xdmVJt3lBGVYbjAZgRzcXcFWxUJUgRost3t/IRHrNY0PmldcgG5NgOdSVsx5eYXMk4Ib
         octeYg7bz+etr8Po0w8PIrd+e9v/qf0gX0t4ZifAKmplSIyxK9qp+NgukM02ytLhozgr
         Irjw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=xq9RPwStefHvCymyxByn3fvYLabVFMEidu2zASdQcu0=;
        b=AVjP1JrnCBIbQ/p63eibS6WRSnVWr08tTO9a/1aqGmqM8/rt0AhX5jN3L7SVbm4ooM
         sSELzHUUg3g+GNW43FePCnWjFnWND5vRoMqsApSYRSm4jdtkXZjGD9jPe0eIjLtRviVq
         0LhccIgZDB/URGhBrS2decRC4PfTo9wK/PekxNqQYJj9doYy7Bf68tu28ZCo5N5jdcqa
         0vuWDacAOk3OUEbcYpBTkEisuLwOClbxWgin9dH/Gd/Q4ULSy3U9AKqEY8ysJHnWaSmW
         HebgY22cdy6eSHUf9RT9v4OyZDLlkjJcoJPlTGAQcFBO5YOfwespiJsdTyW8l67ndn3Z
         oC9A==
X-Gm-Message-State: AOAM532ytW/ZQoKAlQnSHMsk1Qu9p0pMLNNQH+pnWB7wRZ75sUnpiTNL
        h61ejoKjbAVXyhnuyad+SdkjGuju
X-Google-Smtp-Source: 
 ABdhPJyZBo+zX3d+JS5eH5Iv7771KLHseOIarzbTIe83Lg85hPLFiwIPooN/wgZsIo7fxDbvY5a1GQ==
X-Received: by 2002:adf:8023:: with SMTP id 32mr8118751wrk.247.1590742291682;
        Fri, 29 May 2020 01:51:31 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.30
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:30 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters
 chunk
Date: Fri, 29 May 2020 10:50:26 +0200
Message-Id: <20200529085038.26008-23-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Write the Modified Path Bloom Filters chunk first, keeping track of
the offsets where each (non-embedded) Bloom filter has been written,
and then write the Modified Path Bloom Filter Index chunk using those
recorded offsets.
---
 commit-graph.c | 78 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 69 insertions(+), 9 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fb24600bb3..3210ec2f93 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -26,6 +26,7 @@
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
 #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX 0x4d504249 /* "MPBI" */
+#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS 0x4d504246 /* "MPBF" */
 #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES 0x4d504258 /* "MPBX" */
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -56,6 +57,12 @@
 
 struct modified_path_bloom_filter_info {
 	struct bloom_filter filter;
+	/*
+	 * The offset relative to the start of the Modified Path Bloom
+	 * Filters chunk where this Bloom filter has been written,
+	 * -1 before that.
+	 */
+	uint64_t offset;
 };
 
 static void free_modified_path_bloom_filter_info_in_slab(
@@ -1019,6 +1026,9 @@ struct write_commit_graph_context {
 		 */
 		uint32_t *hashes;
 		int hashes_nr, hashes_alloc;
+
+		/* Excluding embedded modified path Bloom filters */
+		uint64_t total_filter_size;
 	} mpbfctx;
 };
 
@@ -1219,6 +1229,43 @@ static int write_graph_chunk_extra_edges(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f,
+		struct write_commit_graph_context *ctx)
+{
+	uint64_t offset = 0;
+	int i;
+
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *commit = ctx->commits.list[i];
+		struct modified_path_bloom_filter_info *bfi;
+		unsigned int filter_size;
+
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		bfi = modified_path_bloom_filters_peek(
+				&modified_path_bloom_filters, commit);
+
+		if (!bfi || !bfi->filter.nr_bits)
+			continue;
+		if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS)
+			continue;
+
+		if (offset >> 62)
+			BUG("offset %lu is too large for the Modified Path Bloom Filter Index chunk",
+			    offset);
+
+		bfi->offset = offset;
+
+		filter_size = bloom_filter_bytes(&bfi->filter);
+
+		hashwrite_be32(f, bfi->filter.nr_bits);
+		hashwrite(f, bfi->filter.bits, filter_size);
+
+		offset += sizeof(uint32_t) + filter_size;
+	}
+	return 0;
+}
+
 static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f,
 		struct write_commit_graph_context *ctx)
 {
@@ -1247,8 +1294,11 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f,
 			 */
 			filterdata[0] |= 1 << 7;
 			hashwrite(f, filterdata, sizeof(filterdata));
+		} else if (bfi->offset != -1) {
+			uint64_t offset = htonll(bfi->offset);
+			hashwrite(f, &offset, sizeof(offset));
 		} else
-			BUG("writing non-embedded Bloom filters is not implemented yet");
+			BUG("modified path Bloom filter offset is still -1?!");
 	}
 	return 0;
 }
@@ -1424,17 +1474,20 @@ static void create_modified_path_bloom_filter(
 	diff_tree_oid(parent_oid, &commit->object.oid, "", &mpbfctx->diffopt);
 	path_component_count = mpbfctx->hashes_nr / mpbfctx->num_hashes;
 
+	bfi = modified_path_bloom_filters_at(&modified_path_bloom_filters,
+					     commit);
+	bfi->offset = -1;
 	if (path_component_count > mpbfctx->embedded_limit) {
-		/* Not implemented yet. */
-	} else {
-		bfi = modified_path_bloom_filters_at(
-				&modified_path_bloom_filters, commit);
-
+		bloom_filter_init(&bfi->filter, mpbfctx->num_hashes,
+				  path_component_count);
+		mpbfctx->total_filter_size += sizeof(uint32_t) +
+					      bloom_filter_bytes(&bfi->filter);
+	} else
 		bloom_filter_init_with_size(&bfi->filter,
 				GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS);
-		bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes,
-				      mpbfctx->hashes_nr);
-	}
+
+	bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes,
+			      mpbfctx->hashes_nr);
 }
 
 static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit)
@@ -1845,6 +1898,13 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		chunks_nr++;
 	}
 	if (ctx->mpbfctx.use_modified_path_bloom_filters) {
+		if (ctx->mpbfctx.total_filter_size) {
+			ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
+			chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS;
+			chunks[chunks_nr].size = ctx->mpbfctx.total_filter_size;
+			chunks[chunks_nr].write_fn = write_graph_chunk_modified_path_bloom_filters;
+			chunks_nr++;
+		}
 		ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
 		chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX;
 		chunks[chunks_nr].size = sizeof(uint8_t) +

From patchwork Fri May 29 08:50:27 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578167
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 86881912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:57 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 695BB207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:57 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="u0gSeoIf"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727025AbgE2Iv4 (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:56 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44438 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726954AbgE2Ivf (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:35 -0400
Received: from mail-wm1-x344.google.com (mail-wm1-x344.google.com
 [IPv6:2a00:1450:4864:20::344])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CB554C08C5C9
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:34 -0700 (PDT)
Received: by mail-wm1-x344.google.com with SMTP id c71so2286943wmd.5
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=E25T6WKYfK/7sdSOmyHdVl5+9CRJNLnuNQ7Z36H+LLc=;
        b=u0gSeoIfcWmdkYQS1Mq+I01/iHFmWEf30EdZIHk+lEx4fuRzdDvYoQgi+DxV5SjBC7
         /Qb4jYithl2O6Aid6urD0x3cSscEIUprkuOVVPoSWQRjqCfkafA0HQNdaKoDwtF+BOZb
         5mVVzmLDhDzONRfxSIwyJiS/KctNm+tzqv2iJqYNLhp2fM284LKT7RnAbN49B6hC2grx
         4DEENCAw+PyfKmKHYaaZ9t7lF3Vlwtr20u8HHQlQI+y/N56lCYHXAc+BVosqiHsO1Gl0
         ZDzKf8fnfMawLORjHn0ORW05v6va9FsSE82vWWXfrugjDGlb9eg9fuhPxw/7DmZ1wi18
         kHYg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=E25T6WKYfK/7sdSOmyHdVl5+9CRJNLnuNQ7Z36H+LLc=;
        b=NLLJ7k8PXAvCR1FE31/j+bpZalB9YysDbUyjXwKY/DONHSK9Tccqd+jfAKSCLf/rQy
         P50XXXvcf1Xb+OZ5vGHBFIn+2FfDtV9BauL+iIisHp81JEe8fENtsrWDgZX16XtbWlsk
         MqLL1taUyOwemP8qYFwk+97L+AM7W7YhbPIyjGcoOTHkfHGxif2rdR4qTsT9wScUYD25
         vLHbQHrg9VWiZL4onaVZArFV0lmPYmjxxWxexeVux607rYADzu2eUElGKlbL41TQpzO6
         4Fk8gm9T4egcjynTHAGYPQ4u5kXct0+gWfQPkKTYzCPLEAEWOyEnFPxorRfWxijQQ02t
         Q6tQ==
X-Gm-Message-State: AOAM530a8emzs7jAtsmTybdw+EgbqmAxLyuoBEHmX5+b42figEABlxOm
        VACWE0yaE42ghINFFaLrZqfDu7NM
X-Google-Smtp-Source: 
 ABdhPJwZccDYZQ9BoRH5DC0i1J4+AgGqP4BqG8ZqJHeJYUERh8v9WgHXWErrozRrh76iMHIDA+57Gw==
X-Received: by 2002:a7b:c353:: with SMTP id l19mr7758614wmj.187.1590742293168;
        Fri, 29 May 2020 01:51:33 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.32
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:32 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 23/34] commit-graph: load and use the Modified Path Bloom
 Filters chunk
Date: Fri, 29 May 2020 10:50:27 +0200
Message-Id: <20200529085038.26008-24-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

The table below compares the average runtime and memory usage of

  git rev-list HEAD -- "$path"

with and without modified path Bloom filters for 5000+ randomly
selected paths from each repository:

                  Average runtime            |       Max RSS
                 without     with   Speedup  |  without   with
  -------------------------------------------+-------------------------
  android-base   0.8780s   0.1742s    5.04x  |   387MB   227MB   -41.4%
  cmssw          0.3143s   0.0452s    6.95x  |   181MB    79MB   -56.4%
  cpython        0.7453s   0.0956s    7.80x  |   159MB    71MB   -55.4%
  elasticsearch  0.1492s   0.0191s    7.82x  |   134MB    53MB   -60.2%
  gcc            7.1852s   0.3584s   20.05x  |   297MB   231MB   -22.3%
  gecko-dev      4.6113s   0.6318s    7.30x  |   832MB   600MB   -27.9%
  git            0.6180s   0.0405s   15.26x  |   131MB    43MB   -67.2%
  glibc          0.5618s   0.0471s   11.93x  |   136MB    50MB   -63.3%
  go             0.4913s   0.0515s    9.53x  |   130MB    50MB   -61.5%
  jdk            0.0482s   0.0089s    5.42x  |    52MB    39MB   -25.0%
  linux          0.7043s   0.1129s    6.24x  |   438MB   270MB   -38.4%
  llvm-project   2.6844s   0.4873s    5.51x  |   384MB   282MB   -26.6%
  rails          0.2784s   0.0484s    5.75x  |    88MB    51MB   -42.1%
  rust           0.7757s   0.0619s   12.53x  |   345MB    89MB   -74.3%
  tensorflow     0.6258s   0.0642s    9.74x  |   233MB    85MB   -63.5%
  webkit         1.9137s   0.3332s    5.74x  |   941MB   480MB   -49.0%
---
 commit-graph.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 commit-graph.h |  2 ++
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3210ec2f93..f9a21ecdfb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -327,6 +327,15 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 			}
 			break;
 
+		case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS:
+			if (graph->chunk_mpbf_filters)
+				chunk_repeated = 1;
+			else {
+				graph->chunk_mpbf_filters = data + chunk_offset;
+				graph->chunk_mpbf_filters_size = next_chunk_offset - chunk_offset;
+			}
+			break;
+
 		case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES:
 			if (graph->chunk_mpbf_excludes)
 				chunk_repeated = 1;
@@ -830,6 +839,7 @@ static int load_modified_path_bloom_filter_from_graph(
 		struct commit *parent, struct bloom_filter *bf)
 {
 	const uint8_t *bloom_index;
+	uint64_t offset;
 	int first_parent = 0;
 
 	if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
@@ -857,9 +867,39 @@ static int load_modified_path_bloom_filter_from_graph(
 		bf->nr_bits = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS;
 		bf->bits = (uint8_t*) bloom_index;
 		return 1;
+	} else if (bloom_index[0] & (1 << 6)) {
+		/*
+		 * Modified path Bloom filters for second..nth parents of
+		 * merge commits are not implemented yet.
+		 */
+		return 0;
+	} else {
+		if (!first_parent)
+			return 0;
+		offset = get_be64(bloom_index);
 	}
-	/* support for non-embedded Bloom filters is not implemented yet. */
-	return 0;
+
+	if (!graph->chunk_mpbf_filters)
+		BUG("commit %s refers to offset %lu of the Modified Path Bloom Filters chunk, but that chunk is missing",
+		    oid_to_hex(&commit->object.oid), offset);
+
+	if (offset + sizeof(uint32_t) >= graph->chunk_mpbf_filters_size)
+		BUG("commit %s refers to offset %lu of the Modified Path Bloom Filters chunk, but that's too large for chunk of size %lu bytes",
+		    oid_to_hex(&commit->object.oid), offset,
+		    graph->chunk_mpbf_filters_size);
+
+	bf->nr_bits = get_be32(graph->chunk_mpbf_filters + offset);
+	if (!bf->nr_bits)
+		BUG("commit %s has a modified path Bloom filter at offset %lu, which has zero size",
+		    oid_to_hex(&commit->object.oid), offset);
+	if (offset + sizeof(uint32_t) + bloom_filter_bytes(bf) > graph->chunk_mpbf_filters_size)
+		BUG("commit %s has a modified path Bloom filter of %u bits at offset %lu, which doesn't fit into a Modified Path Bloom Filters chunk of %lu bytes",
+		    oid_to_hex(&commit->object.oid), bf->nr_bits, offset,
+		    graph->chunk_mpbf_filters_size);
+	/* Casting away const-ness :( */
+	bf->bits = (uint8_t*)(graph->chunk_mpbf_filters + offset + sizeof(uint32_t));
+
+	return 1;
 }
 
 enum bloom_result check_modified_path_bloom_filter(struct repository *r,
diff --git a/commit-graph.h b/commit-graph.h
index 09dfc16932..cde0d7fa30 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -69,6 +69,8 @@ struct commit_graph {
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_mpbf_index;
 	uint64_t chunk_mpbf_index_size;
+	const unsigned char *chunk_mpbf_filters;
+	uint64_t chunk_mpbf_filters_size;
 	const unsigned char *chunk_mpbf_excludes;
 
 	uint8_t num_modified_path_bloom_hashes;

From patchwork Fri May 29 08:50:28 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578165
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9F5F4739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:55 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 85E1320723
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:55 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="X3uBlyDd"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727024AbgE2Ivy (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:54 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44446 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726310AbgE2Ivh (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:37 -0400
Received: from mail-wr1-x441.google.com (mail-wr1-x441.google.com
 [IPv6:2a00:1450:4864:20::441])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3B060C08C5CA
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:36 -0700 (PDT)
Received: by mail-wr1-x441.google.com with SMTP id x6so2472289wrm.13
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=DvIgrHUiH2ABGw/jGLUuC8AlH0nKKG7Z7r1VC6CPdHE=;
        b=X3uBlyDdKP7USIdPN3RGAllPgiQoOlswdxApvHlxb/4yp90tDdj534cLU1R62MaI9M
         5I1/U9a+xnr5RegGjvbGbkvEeYumOaww4GdKbsq9x3U17LzVwEKVoRGmz9ZEIb6i9NPw
         FPQq1Y1kt5ywQrgIrRhuUzUkdnGy10jBau0a3xMQJn7QY8C91KwP5YQPPKGDRR9zB1gs
         Ygxlq2NgnnSGnTPvYwkqUgtO8RO+k3O0q5AlAlzNiHyVCZTnifj3oLdVsHmc9rHpt50l
         2qSFRuZogkciaSCmbSe3j7ySEq5KwuIOamomEO7+aW3faaGshm/xNvZ7c6nwyHkRidAE
         czzQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=DvIgrHUiH2ABGw/jGLUuC8AlH0nKKG7Z7r1VC6CPdHE=;
        b=aocnErldmbelnnNWWh24xlsyJqHDky4cP/zVEGi3FOFITg+wpowps/j0HqIPa36N5O
         f2B/mYwTfCV788nG4fTMpz6bGMxo0IB8A+aq2CAR8QguwEGb8G8/sPsjZpXlkM0MXJZ0
         Zqoz37pA51U7ikqMpF/x54OH5ythfFV+kRAmjPxuEn/gkXXD1j1z+wvgh2wYurBaljiY
         o/wmnHmv4Av/3xQmnmRyXwiDhwvylR/do2pNAeeyWtaCkRCP4xYSZN3+k2d48tFYsy+P
         c99Nt/Kpix4LFNOZemab2UZw4S09XzE/QKtj3zJVW3ABE6p621SZVBSqiyQGCW1v+kuB
         1Jmw==
X-Gm-Message-State: AOAM5334japVeQE5dPam4S1wA3mTtWnU9aNtmMZhCF4Y5QWyoOV4coxt
        TNt2rK3rJZ/eZxnrDG3UE7JDY0iy
X-Google-Smtp-Source: 
 ABdhPJzbMFN9uaRLHPf2wcGaNUDYyVbuqZ8GzO14XIb6SvR+tO1erdSUXqAT7LEQi83k6CirgCruVQ==
X-Received: by 2002:a5d:4d0d:: with SMTP id z13mr8179942wrt.220.1590742294611;
        Fri, 29 May 2020 01:51:34 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.33
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:34 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 24/34] commit-graph: check all leading directories in modified
 path Bloom filters
Date: Fri, 29 May 2020 10:50:28 +0200
Message-Id: <20200529085038.26008-25-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.

So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well.  Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.

This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks.  The table below compares the average
false positive rate and runtime of

  git -c core.modifiedPathBloomFilters=1 rev-list HEAD -- "$path"

before and after this change for 5000+ randomly selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  android-base    0.526%   0.00416%    0.1727s   0.1503s   -13.0%
  cmssw           0.494%   0.00395%    0.0426s   0.0332s   -22.0%
  cpython         0.213%   0.01661%    0.0940s   0.0858s    -8.8%
  elasticsearch   0.679%   0.00325%    0.0191s   0.0147s   -23.2%
  gcc             0.398%   0.00892%    0.3423s   0.2192s   -36.0%
  gecko-dev       0.472%   0.00073%    0.6243s   0.5134s   -17.8%
  git             0.191%   0.06992%    0.0384s   0.0319s   -17.1%
  glibc           0.392%   0.01639%    0.0425s   0.0296s   -30.5%
  go              0.453%   0.01262%    0.0515s   0.0425s   -17.5%
  jdk             0.662%   0.00643%    0.0083s   0.0070s   -15.0%
  linux           0.434%   0.00749%    0.1102s   0.0911s   -17.3%
  llvm-project    0.511%   0.00391%    0.4865s   0.4290s   -11.8%
  rails           0.476%   0.01313%    0.0464s   0.0410s   -11.6%
  rust            0.457%   0.02551%    0.0590s   0.0462s   -21.7%
  tensorflow      0.511%   0.00824%    0.0642s   0.0517s   -19.5%
  webkit          0.661%   0.00101%    0.3315s   0.2576s   -22.3%

The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here.  However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches).  If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:

  $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
  $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
        commit-graph write --reachable
  $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
  $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
        rev-list HEAD -- "$path"

then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f9a21ecdfb..1fd1b4f8dd 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -977,8 +977,10 @@ void init_pathspec_bloom_fields(struct repository *r,
 
 	for (i = 0; i < pathspec->nr; i++) {
 		struct pathspec_item *pi = &pathspec->items[i];
-		const char *path = pi->match;
+		const char *path = pi->match, *p;
 		size_t len = pi->len;
+		int path_component_nr = 0, j;
+		uint32_t *hashes;
 
 		/*
 		 * Pathspec parsing has normalized away any consecutive
@@ -988,13 +990,28 @@ void init_pathspec_bloom_fields(struct repository *r,
 		if (path[len - 1] == '/')
 			len--;
 
-		pi->modified_path_bloom_hashes_nr = graph->num_modified_path_bloom_hashes;
+		p = path;
+		do {
+			p = strchrnul(p + 1, '/');
+			path_component_nr++;
+		} while (p - path < len);
+
+		pi->modified_path_bloom_hashes_nr = path_component_nr * graph->num_modified_path_bloom_hashes;
 		ALLOC_ARRAY(pi->modified_path_bloom_hashes,
 			    pi->modified_path_bloom_hashes_nr);
 
-		compute_modified_path_bloom_hashes_for_path(path, len,
-				graph->num_modified_path_bloom_hashes,
-				pi->modified_path_bloom_hashes);
+		p = path;
+		hashes = pi->modified_path_bloom_hashes +
+			 pi->modified_path_bloom_hashes_nr;
+		for (j = 0; j < path_component_nr; j++) {
+			p = strchrnul(p + 1, '/');
+
+			hashes -= graph->num_modified_path_bloom_hashes;
+			compute_modified_path_bloom_hashes_for_path(path,
+					p - path,
+					graph->num_modified_path_bloom_hashes,
+					hashes);
+		}
 	}
 
 	pathspec->can_use_modified_path_bloom_filters = 1;

From patchwork Fri May 29 08:50:29 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578163
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0096A912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:54 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id D9310207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:53 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="hwgA9ZTI"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727023AbgE2Ivw (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:52 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44458 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726955AbgE2Ivk (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:40 -0400
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C043AC08C5CB
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:37 -0700 (PDT)
Received: by mail-wr1-x443.google.com with SMTP id x14so2605197wrp.2
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=M1jiwZwpSev+ojVUCt2eBacB1NNB5MBIy1HzFiFF75s=;
        b=hwgA9ZTIAtWlqPbG0maEw5yQ2gK2qtW0Vx7MHmXemOvtrsMdR/dSwpHKNL9Fo7BEty
         GwJPZkFHDmtaKXjGxFRTZGCh3zStrh00zYn22ylBhpit4z+kwlLY6GbhvbZn/sosPaJG
         8DH2MVvd0HS4mECEKQEp/3fmQBlDSsTfi1s8LRq+2o3FVBwauD8R9tK2tmHQrmFO3eg7
         jqHuebxKRpa0kczstSfsVWQq2tz1S47Ype5EC6KtH9I9WqXo4CbBQzSNRK8Zgzh8r7wo
         ITh6bAg0kd+LTF4ouHr49SIwtijO6Z5d0kyGLO+nc+sYND2mjg+2zSDO+YG5qhOr5+T7
         oUbA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=M1jiwZwpSev+ojVUCt2eBacB1NNB5MBIy1HzFiFF75s=;
        b=rtnCSOCWTbxmM27EmEqp85rOu9aZKFp3miRIycBRvIW7nO4QrdPSMIbmivvvU3qSaH
         zqSjekhDSITcX0h3GbqJT+dRNhmShu9ChK0AjeL8qLXq85Gm2kzlmC5OqfYlmOLwEPHw
         BrcHlWl7ynxnb7pOxsyhgVtLOlYsVY5X0DFHNmzSHr6lInaHpb0/iYlIL2iH0xn8D5UH
         l/af6lI6J+JaVK4q7etF+G6XCPqOM0WtLi037GIvzpbykkLgHHe6DDPe9Hl5cnhRZ/53
         o3fXhkI6Ztyiy+VQIX3HfiRH2qPwIpoQcZX2jueiNOIW8cIsSILp2JJ801Wp4wf4n/Xu
         WYYQ==
X-Gm-Message-State: AOAM532vypLisJ9dxTYr+UiwwBp4hrzsR7U0k9pKkhPBJ5Ju/Qnmo20I
        FESDy+vsaMtcNDJTaCnQcGvh/3V1
X-Google-Smtp-Source: 
 ABdhPJzP2hVeBT/3MfuTWJjGaPqlthoRmbLTqZPOG0cVCb6iAOrSPI/med1403RHTMCmvrnDjfQyDQ==
X-Received: by 2002:a5d:6391:: with SMTP id p17mr8222809wru.118.1590742296164;
        Fri, 29 May 2020 01:51:36 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.34
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:35 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 25/34] commit-graph: check embedded modified path Bloom
 filters with a mask
Date: Fri, 29 May 2020 10:50:29 +0200
Message-Id: <20200529085038.26008-26-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Since the previous patch we check the presence of all leading
directories of a pathspec in modified path Bloom filters to
significantly lower the false positive rate.  This means that we are
checking a lot of bit positions in the Bloom filters.  However, as
shown earlier in this series, a significant portion of commits have
embedded modified path Bloom filters, all with the same size of 63
bits.

Use a 64 bit mask to check all relevant bits in embedded modified path
Bloom filters in one step.

I don't have benchmark results at hand, but as far as I can recall
some old results this reduces the time spent loading and checking
modified path Bloom filters by up to 10%.  Checking Bloom filters is
of course only a small part of the whole runtime of a pathspec-limited
revision walk, so the overall improvement is only about 1-2%.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 17 +++++++++++++++--
 pathspec.c     |  1 +
 pathspec.h     |  1 +
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1fd1b4f8dd..56b8f33d10 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -918,10 +918,16 @@ enum bloom_result check_modified_path_bloom_filter(struct repository *r,
 	for (i = 0; i < pathspec->nr; i++) {
 		struct pathspec_item *pi = &pathspec->items[i];
 
-		if (bloom_filter_check_bits(&bf,
+		if (bf.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) {
+			/* Gross!  And potential alignment issues?! */
+			if ((*((const uint64_t*)bf.bits) & pi->modified_path_bloom_mask) == pi->modified_path_bloom_mask)
+				return BLOOM_POSSIBLY_YES;
+		} else {
+			if (bloom_filter_check_bits(&bf,
 					pi->modified_path_bloom_hashes,
 					pi->modified_path_bloom_hashes_nr))
-			return BLOOM_POSSIBLY_YES;
+				return BLOOM_POSSIBLY_YES;
+		}
 	}
 
 	return BLOOM_DEFINITELY_NOT;
@@ -981,6 +987,7 @@ void init_pathspec_bloom_fields(struct repository *r,
 		size_t len = pi->len;
 		int path_component_nr = 0, j;
 		uint32_t *hashes;
+		struct bloom_filter embedded_bf;
 
 		/*
 		 * Pathspec parsing has normalized away any consecutive
@@ -1012,6 +1019,12 @@ void init_pathspec_bloom_fields(struct repository *r,
 					graph->num_modified_path_bloom_hashes,
 					hashes);
 		}
+
+		embedded_bf.nr_bits = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS;
+		embedded_bf.bits = (uint8_t*) &pi->modified_path_bloom_mask;
+		bloom_filter_set_bits(&embedded_bf,
+				      pi->modified_path_bloom_hashes,
+				      pi->modified_path_bloom_hashes_nr);
 	}
 
 	pathspec->can_use_modified_path_bloom_filters = 1;
diff --git a/pathspec.c b/pathspec.c
index 01e7ae6944..29cadce013 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -408,6 +408,7 @@ static void init_pathspec_item(struct pathspec_item *item, unsigned flags,
 	item->attr_match_nr = 0;
 	item->modified_path_bloom_hashes_nr = 0;
 	item->modified_path_bloom_hashes = NULL;
+	item->modified_path_bloom_mask = 0;
 
 	/* PATHSPEC_LITERAL_PATH ignores magic */
 	if (flags & PATHSPEC_LITERAL_PATH) {
diff --git a/pathspec.h b/pathspec.h
index 0c661e1b03..c9975aea7a 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -64,6 +64,7 @@ struct pathspec {
 		 */
 		uint32_t modified_path_bloom_hashes_nr;
 		uint32_t *modified_path_bloom_hashes;
+		uint64_t modified_path_bloom_mask;
 	} *items;
 };
 

From patchwork Fri May 29 08:50:30 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578161
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0B3F3912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:53 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id E209E207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:52 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="mnyZ0eun"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727022AbgE2Ivw (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:52 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44460 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726957AbgE2Ivk (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:40 -0400
Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com
 [IPv6:2a00:1450:4864:20::443])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7AC25C08C5D1
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:39 -0700 (PDT)
Received: by mail-wr1-x443.google.com with SMTP id r7so2613132wro.1
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=/A29Mh3EQ2rOKDAMMIAdW+Wb9CGT8wULU5GEW01rZdA=;
        b=mnyZ0eunFnpNpHALPcyMyeAd2tfjL9JaGQr1q67o1n2WXczmamjvERxX2cR34WlgJh
         Lja7LYoUZe7PMEIyw7PdEPJYWCrJw+QnMH31ED+Sa+jggigRmdGAzLX6HrfIrD8wrKFB
         CJ4hUhhZuIaqzqqw8+uWJRzwxgIBzdYbuSQPVibz50ZH/x1dkILZbLis1nv36X5z1D6r
         vod3NxU/up7rfZlF+LemF+iv/eTIbGCre4ZdXlFueQWWYx8NXbgR9mq3CQwUKnMDYhLw
         m+Iq8ApuDTRzbGfj+wNkqpYJ8wt5U5mhcBdT+1wbK7q8tkBGiiaqTNaoH6pHvITVAVns
         lbGQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=/A29Mh3EQ2rOKDAMMIAdW+Wb9CGT8wULU5GEW01rZdA=;
        b=eTpovHQfvqDHb7+Zuc8wMQk4RGJsRlgJfv2TgueXko5xvDBb0mPVb9zTdyzTHVS0+g
         mov+4PWJIhEEWBsaEIO0clnrS1KqEcpGN2Oz7+w62SxuKAUL/QrKTeYUao7yViBix6Tq
         ggD1s9h3cCe8wmp5q285hzDVaMhqsT6TH/05fw7IowLFPk/eTu8OQ2e4/FrmqZJNEh/T
         lf2XQtt/hE19TJjy/hx4DH68kQRnRHmZluMTDcETJlVSk9pRw6aBDlrfVuSBC/Y3uNhj
         MxqAdytly9a+0XhXYFj+1hsY8kvp+Z+NafP79JmQ2OI48IQE1PH0qVK/tewWf7dRI9XN
         JzLA==
X-Gm-Message-State: AOAM533ILLq6uuFJw217IjTh5z7aeORKEERnYH//pEZAJWLJsNU30OpD
        XwSJbd6siZtxjt8c3cI65beD5/KB
X-Google-Smtp-Source: 
 ABdhPJzm+YVWWBYH4C+ImJJfvWgg4A3g1s1b8Rx6w8m44jaL+PC0I3sNcnKYclitAIwc/CMGLz8wvQ==
X-Received: by 2002:a5d:500d:: with SMTP id e13mr8147685wrt.150.1590742297848;
        Fri, 29 May 2020 01:51:37 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.36
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:37 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters
Date: Fri, 29 May 2020 10:50:30 +0200
Message-Id: <20200529085038.26008-27-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Some commits modify the same set of paths, and, consequently, have
identical modified path Bloom filters.

Use a hashmap to find these identical Bloom filters, and omit all
duplicates from the Modified Path Bloom Filters chunk, reducing its
size.

                   MPBF chunk size
                 without       with             Time spent
                    deduplication                deduping
  --------------------------------------------------------
  android-base   8620198     2618764    -69.6%    0.089s
  cmssw         10347018     9094468    -12.1%    0.056s
  cpython         526381      371629    -29.4%    0.013s
  elasticsearch  4733274     3607106    -23.8%    0.029s
  gcc            3724544     3394521     -8.9%    0.072s
  gecko-dev     20591010    17042732    -17.2%    0.235s
  git             160718      139546    -13.2%    0.004s
  glibc           759132      699623     -7.8%    0.009s
  go              458375      421394     -8.1%    0.008s
  jdk           13213208    12158533     -8.0%    0.034s
  linux         26575278    25029700     -5.8%    0.169s
  llvm-project   3988133     3269438    -18.0%    0.085s
  rails           958691      614528    -35.9%    0.015s
  rust           1985782     1682919    -15.3%    0.023s
  tensorflow     4173570     3789768     -9.2%    0.022s
  webkit         5891751     5394507     -8.4%    0.096s

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 56b8f33d10..178bbf0113 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -57,6 +57,7 @@
 
 struct modified_path_bloom_filter_info {
 	struct bloom_filter filter;
+	struct modified_path_bloom_filter_info *duplicate_of;
 	/*
 	 * The offset relative to the start of the Modified Path Bloom
 	 * Filters chunk where this Bloom filter has been written,
@@ -1099,6 +1100,9 @@ struct write_commit_graph_context {
 
 		/* Excluding embedded modified path Bloom filters */
 		uint64_t total_filter_size;
+
+		/* Used to find identical modified path Bloom filters */
+		struct hashmap dedup_hashmap;
 	} mpbfctx;
 };
 
@@ -1315,7 +1319,11 @@ static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f,
 		bfi = modified_path_bloom_filters_peek(
 				&modified_path_bloom_filters, commit);
 
-		if (!bfi || !bfi->filter.nr_bits)
+		if (!bfi)
+			continue;
+		if (bfi->duplicate_of)
+			continue;
+		if (!bfi->filter.nr_bits)
 			continue;
 		if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS)
 			continue;
@@ -1364,6 +1372,9 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f,
 			 */
 			filterdata[0] |= 1 << 7;
 			hashwrite(f, filterdata, sizeof(filterdata));
+		} else if (bfi->duplicate_of) {
+			uint64_t offset = htonll(bfi->duplicate_of->offset);
+			hashwrite(f, &offset, sizeof(offset));
 		} else if (bfi->offset != -1) {
 			uint64_t offset = htonll(bfi->offset);
 			hashwrite(f, &offset, sizeof(offset));
@@ -1515,6 +1526,53 @@ static void file_change_cb(struct diff_options *options,
 	handle_modified_file(options->change_fn_data, fullpath);
 }
 
+struct modified_path_bloom_filter_dedup_entry {
+	struct hashmap_entry entry;
+	struct modified_path_bloom_filter_info *bfi;
+};
+
+static int modified_path_bloom_filter_dedup_cmp(const void *unused_cmp_data,
+		const struct hashmap_entry *he1,
+		const struct hashmap_entry *he2,
+		const void *unused_keydata)
+{
+	const struct modified_path_bloom_filter_dedup_entry *a, *b;
+
+	a = container_of(he1, const struct modified_path_bloom_filter_dedup_entry, entry);
+	b = container_of(he2, const struct modified_path_bloom_filter_dedup_entry, entry);
+
+	if (a->bfi->filter.nr_bits != b->bfi->filter.nr_bits)
+		return 1;
+
+	return memcmp(a->bfi->filter.bits, b->bfi->filter.bits,
+		      bloom_filter_bytes(&a->bfi->filter));
+}
+
+static int handle_duplicate_modified_path_bloom_filter(
+		struct modified_path_bloom_filter_context *mpbfctx,
+		struct modified_path_bloom_filter_info *bfi)
+{
+	struct modified_path_bloom_filter_dedup_entry *e, stack_entry;
+	unsigned int h;
+
+	h = memhash(bfi->filter.bits, bloom_filter_bytes(&bfi->filter));
+	hashmap_entry_init(&stack_entry.entry, h);
+	stack_entry.bfi = bfi;
+
+	e = hashmap_get_entry(&mpbfctx->dedup_hashmap, &stack_entry, entry,
+			      NULL);
+	if (e) {
+		bfi->duplicate_of = e->bfi;
+		return 1;
+	} else {
+		e = xmalloc(sizeof(*e));
+		hashmap_entry_init(&e->entry, h);
+		e->bfi = bfi;
+		hashmap_add(&mpbfctx->dedup_hashmap, &e->entry);
+		return 0;
+	}
+}
+
 static void create_modified_path_bloom_filter(
 		struct modified_path_bloom_filter_context *mpbfctx,
 		struct commit *commit)
@@ -1547,17 +1605,20 @@ static void create_modified_path_bloom_filter(
 	bfi = modified_path_bloom_filters_at(&modified_path_bloom_filters,
 					     commit);
 	bfi->offset = -1;
-	if (path_component_count > mpbfctx->embedded_limit) {
+	if (path_component_count > mpbfctx->embedded_limit)
 		bloom_filter_init(&bfi->filter, mpbfctx->num_hashes,
 				  path_component_count);
-		mpbfctx->total_filter_size += sizeof(uint32_t) +
-					      bloom_filter_bytes(&bfi->filter);
-	} else
+	else
 		bloom_filter_init_with_size(&bfi->filter,
 				GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS);
 
 	bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes,
 			      mpbfctx->hashes_nr);
+
+	if (path_component_count > mpbfctx->embedded_limit &&
+	    !handle_duplicate_modified_path_bloom_filter(mpbfctx, bfi))
+		mpbfctx->total_filter_size += sizeof(uint32_t) +
+					      bloom_filter_bytes(&bfi->filter);
 }
 
 static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit)
@@ -2396,6 +2457,8 @@ int write_commit_graph(const char *obj_dir,
 		diff_setup_done(&ctx->mpbfctx.diffopt);
 
 		strbuf_init(&ctx->mpbfctx.prev_path, PATH_MAX);
+		hashmap_init(&ctx->mpbfctx.dedup_hashmap,
+			     modified_path_bloom_filter_dedup_cmp, NULL, 0);
 	}
 
 	if (ctx->split) {
@@ -2525,6 +2588,9 @@ int write_commit_graph(const char *obj_dir,
 		strbuf_release(&ctx->mpbfctx.prev_path);
 		free(ctx->mpbfctx.hashed_prefix_lengths);
 		free(ctx->mpbfctx.hashes);
+		hashmap_free_entries(&ctx->mpbfctx.dedup_hashmap,
+				struct modified_path_bloom_filter_dedup_entry,
+				entry);
 	}
 
 	free(ctx);

From patchwork Fri May 29 08:50:31 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578151
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 68CC6739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:46 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 4F8FA207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:46 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Z5HVuZsK"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726988AbgE2Ivp (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:45 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44462 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726961AbgE2Ivl (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:41 -0400
Received: from mail-wr1-x442.google.com (mail-wr1-x442.google.com
 [IPv6:2a00:1450:4864:20::442])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AD023C0A88B4
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:40 -0700 (PDT)
Received: by mail-wr1-x442.google.com with SMTP id e1so2573360wrt.5
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=hGp0WjsQ6+2XetZJcojt2bGV/K33EZG/AxxU2XagIpw=;
        b=Z5HVuZsKBId+ATSVfZzvwhMpVjtS3iu1LNyugtRHhhvn9Eo1lETIY7JR7sc2GAfEfE
         rUPErqRfQ6dQ4AAiZYMOClNzqv7ypb7A/lW0Lg5qOCL84U+hFAfgiL8CiDL9OrIVPmEw
         gHbDfP/bQ4TceFtsRQvrGk93o6+Hw/pmoFsq38o92xuWB0deGzNKrwMtT9T2bV3Tx15q
         9uNaLCvzz/osCJ5f0RTPBJ6v+z5IQDAk/vnzL57wuNqqWl2Ngg89bKmY0cPUUMcyIaIQ
         gC9ssu7gECDjUNyudTMJvJq9P82h6r7pbN2mP5wTQTP25YliCNZkvao5yoEsLPvYaiSX
         090A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=hGp0WjsQ6+2XetZJcojt2bGV/K33EZG/AxxU2XagIpw=;
        b=Mm2S8+nFCXixt55J6wtTIPgq8qm2WpyQRoJeIiPf6n5ggVn/xCN6LYGZnN8XwRk9AW
         k6Zh462QOUY8ygch+Lqk8Yro3pnK1y3BDtPBZ79B9k1dyFnUQyq3W0odNoUiThW9bJdW
         CGX3JpRlvqgz/HydDEKR1noLOW46cld3TVK7LemuSZSROWKThPJKXIblEJfvdiaxLtRa
         6DND308A+zsOsXnYFWkzkC/upmlrNPDS9n6FUp5YGsaVaUPBVY8I1BF8+9uAmpyyP7IA
         ShEWLV1zzE1+/G7RBplQpLGzEBkoKec+Fmqzegs/cWpyZiuwexqxEth58g0tH6VTWkCS
         XlvQ==
X-Gm-Message-State: AOAM531F1gmra2j8cZxapdt/1xcz+R+GJSEMTuL/XadZJ3jLHaap+7/C
        RBNHXRBQNfSD4cLzDMCivB7J6X5y
X-Google-Smtp-Source: 
 ABdhPJwo/3gb6Ww5ejqS0pSxIO4DOhq2WXXmMauqMXaS8FVBtj6GTiRojVWpKBVHCe3ti7mBCdSUnA==
X-Received: by 2002:a05:6000:12c4:: with SMTP id
 l4mr8189471wrx.128.1590742299212;
        Fri, 29 May 2020 01:51:39 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.38
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:38 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 27/34] commit-graph: load modified path Bloom filters for
 merge commits
Date: Fri, 29 May 2020 10:50:31 +0200
Message-Id: <20200529085038.26008-28-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

---
 commit-graph.c | 53 +++++++++++++++++++++++++++++++++++++++++++++-----
 commit-graph.h |  2 ++
 2 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 178bbf0113..32738ba4b8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -26,6 +26,7 @@
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
 #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX 0x4d504249 /* "MPBI" */
+#define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_MERGE_INDEX 0x4d50424d /* "MPBM" */
 #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS 0x4d504246 /* "MPBF" */
 #define GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_EXCLUDES 0x4d504258 /* "MPBX" */
 
@@ -328,6 +329,15 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 			}
 			break;
 
+		case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_MERGE_INDEX:
+			if (graph->chunk_mpbf_merge_index)
+				chunk_repeated = 1;
+			else {
+				graph->chunk_mpbf_merge_index = data + chunk_offset;
+				graph->chunk_mpbf_merge_index_size = next_chunk_offset - chunk_offset;
+			}
+			break;
+
 		case GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTERS:
 			if (graph->chunk_mpbf_filters)
 				chunk_repeated = 1;
@@ -869,11 +879,44 @@ static int load_modified_path_bloom_filter_from_graph(
 		bf->bits = (uint8_t*) bloom_index;
 		return 1;
 	} else if (bloom_index[0] & (1 << 6)) {
-		/*
-		 * Modified path Bloom filters for second..nth parents of
-		 * merge commits are not implemented yet.
-		 */
-		return 0;
+		struct commit_list *p;
+		uint32_t pos;
+		int found = 0;
+
+		pos = get_be32(bloom_index + sizeof(uint32_t));
+
+		if (!graph->chunk_mpbf_merge_index)
+			BUG("commit %s refers to position %u in the Modified Path Bloom Filter Merge Index chunk, but that chunk is missing",
+			    oid_to_hex(&commit->object.oid), pos);
+
+		for (p = commit->parents; p; p = p->next, pos++)
+			if (p->item == parent) {
+				found = 1;
+				break;
+			}
+		if (!found)
+			BUG("commit %s has no parent %s\n",
+			    oid_to_hex(&commit->object.oid),
+			    oid_to_hex(&parent->object.oid));
+
+		if (pos * sizeof(uint64_t) > graph->chunk_mpbf_merge_index_size)
+			BUG("commit %s and parent %s refer to position %u in the Modified Path Bloom Filter Merge Index chunk, but that's too large for a chunk of size %lu bytes",
+			    oid_to_hex(&commit->object.oid),
+			    oid_to_hex(&parent->object.oid), pos,
+			    graph->chunk_mpbf_merge_index_size);
+		bloom_index = graph->chunk_mpbf_merge_index + sizeof(uint64_t) * pos;
+		if (bloom_index[0] & (1 << 7)) {
+			uint64_t v;
+			memcpy(&v, bloom_index, sizeof(v));
+			if (v == GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE)
+				return 0;
+
+			/* embedded modified path Bloom filter */
+			bf->nr_bits = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS;
+			bf->bits = (uint8_t*) bloom_index;
+			return 1;
+		} else
+			offset = get_be64(bloom_index);
 	} else {
 		if (!first_parent)
 			return 0;
diff --git a/commit-graph.h b/commit-graph.h
index cde0d7fa30..b2605a0e3b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -69,6 +69,8 @@ struct commit_graph {
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_mpbf_index;
 	uint64_t chunk_mpbf_index_size;
+	const unsigned char *chunk_mpbf_merge_index;
+	uint64_t chunk_mpbf_merge_index_size;
 	const unsigned char *chunk_mpbf_filters;
 	uint64_t chunk_mpbf_filters_size;
 	const unsigned char *chunk_mpbf_excludes;

From patchwork Fri May 29 08:50:32 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578153
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EB3A1739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:47 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id C8901207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:47 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="UGR43QJE"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726999AbgE2Ivr (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:47 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44458 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725971AbgE2Ivn (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:43 -0400
Received: from mail-wm1-x341.google.com (mail-wm1-x341.google.com
 [IPv6:2a00:1450:4864:20::341])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 02619C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:43 -0700 (PDT)
Received: by mail-wm1-x341.google.com with SMTP id r15so2416422wmh.5
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=J6YxhcOeTLGvQ4A3/oRLd1a/aKbqOjLybu1G5SKuJ4o=;
        b=UGR43QJEi+ByaewTQP/VyOyNLwA8avRB4fu7yCBsoECeE8IdaswjhNAeGVU9u5gnGz
         Mg0KCDfTBjGDSqglS1t706M0hpxTsLAv+vZ7v3yBrm/FVrU7XkHGPCcMd1cwvBrA/Rf4
         nJoQG4cv2WXHd0nowc5iHlUgVALBG+xdv2izI4MqI4EvRlH45S4fKIUtfKorzBvqJV1G
         Nj6ugh/Kez5FFP81cxURFDukmeXAHFyTNz8KF/oZwI+xITePAb4DJSXpyMUvI095KLYy
         K5CqXy6MjIrlM/rLK2fKh7DZegyeVe3Pvg6bVAvMSJnIBf9J4o7r526u41IWJNzyuCQf
         FPHw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=J6YxhcOeTLGvQ4A3/oRLd1a/aKbqOjLybu1G5SKuJ4o=;
        b=g65O5HLGyRu3LgESpiYLJqR3EoKfC8HPmJAf4r4XJrANDKgAdNWIZoWDDC1bA/cm52
         Y2eOzJkBqdSAlywy/RF9tkec3OoD5WCi6mEphUzykyHwOdj3nkBL6CsSJL4/04MV747J
         N/4bJ78YKq218Ho1JRu58E6/P6wQWSnpHvn0ybcwoDc/aMd1LiB+dSdwlrYwSf3DGeFo
         nVTZzc7bw4m+OKreFv4WFGN6rmbm3s74F63qguPq8FvRcS4Jqn9G53h+6RoGJ6QCoOsH
         qNqWH7v13HhVDNe0Y6egYqdKJ+0ga4Bi4elVRWYmpZkalr00t5GFSn+4aBgqFZUzs8Lj
         nI7g==
X-Gm-Message-State: AOAM531fIMymIVp8mgeMYJTJuIhP934VKTwEODZnMWFRj2BW4n+EmCfZ
        eUrUNY+frRzcqbDTw5rxgFuTkex7
X-Google-Smtp-Source: 
 ABdhPJzySL6vI4u3NmH7Oshm5r8CZKfYPF9MoA8gyenBBcXHSoBscvvgTmIpmyQS8MOq0ZWoNHMddg==
X-Received: by 2002:a1c:6a0d:: with SMTP id f13mr7690219wmc.180.1590742300970;
        Fri, 29 May 2020 01:51:40 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.39
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:40 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge
 Index chunk
Date: Fri, 29 May 2020 10:50:32 +0200
Message-Id: <20200529085038.26008-29-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

[Explain!]

The table below compares the runtime of

  git rev-list --full-history HEAD -- "$path"

with modified path Bloom filters for only first parents and with all
parents of merge commits for 5000+ randomly selected paths from each
repository.  It also shows the size increase of the commit-graph file:

                          |    Average runtime            |   commit-graph file size
                          |                               |
               Percentage |    first      all             |   first     all
                of merge  |   parent     merge    Average |  parent    merge     Size
                 commits  |    only     parents   speedup |   only    parents  increase
  ------------------------+-------------------------------+----------------------------
  android-base    73.6%   |  7.0890s    0.9681s    7.32x  |  29.8MB   218.0MB   7.32x
  cmssw           11.0%   |  0.7164s    0.2577s    2.78x  |  22.9MB    51.8MB   2.26x
  cpython         11.7%   |  0.3815s    0.1460s    2.61x  |   7.8MB    10.6MB   1.36x
  elasticsearch    8.4%   |  0.1217s    0.0582s    2.09x  |   9.6MB    10.9MB   1.14x
  gecko-dev        3.5%   |  1.6432s    0.9689s    1.70x  |  63.5MB    90.4MB   1.42x
  git             25.3%   |  0.8743s    0.1729s    5.06x  |   3.8MB    10.4MB   2.74x
  homebrew-cask    9.6%   |  1.1454s    0.1846s    6.20x  |   6.5MB     7.2MB   1.10x
  jdk             25.0%   |  0.2598s    0.1242s    2.09x  |  15.1MB    20.3MB   1.34x
  linux            7.4%   |  2.8037s    1.3339s    2.10x  |  78.1MB   269.8MB   3.45x
  rails           22.2%   |  0.2854s    0.0926s    3.08x  |   6.0MB     8.4MB   1.40x
  rust            27.0%   |  1.9803s    0.1794s   11.04x  |   8.2MB    22.0MB   2.68x
  tensorflow       9.1%   |  0.3886s    0.1237s    3.14x  |   9.0MB    19.0MB   2.11x

Storing modified path Bloom filters for all parents of merge commits
significantly increases the size of the Modified Path Bloom Filters
chunk, though when looking at the size of the whole commit-graph file
the relative increase is sometimes not that much.  FWIW, the average
speedup in most cases (except android-base and linux) is higher than
the size increase of the commit-graph file.

Beware!  To create a Bloom filter from the paths modified between a
commit and one of its parents, it seems natural to extend
create_modified_path_bloom_filter() with a 'struct commit *parent'
parameter...  and then things would blow up when encountering commit
a733a5da97b2 (Merge branches 'release' and 'fluff' into release,
2008-02-07) in linux.git:

  $ git -C linux.git log -1 a733a5da97b2 |head -n2
  commit a733a5da97b238e3e3167d3d0aee8fe1e8d04e97
  Merge: 299cfe38081b 299cfe38081b 9e52797131e8

Notice how the same commit is both the first _and_ second parent!  So
extend create_modified_path_bloom_filter()'s signature with the int
index of the particular parent.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 Documentation/config/core.txt |  13 +++
 commit-graph.c                | 169 +++++++++++++++++++++++++++++-----
 2 files changed, 158 insertions(+), 24 deletions(-)

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index 820f7d3bcf..b4311c5f31 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -594,6 +594,19 @@ core.modifiedPathBloomFilters::
 	commit-graph file to speed up pathspec-limited revision walks.
 	Defaults to false.
 
+core.modifiedPathBloomFiltersForAllMergeParents::
+	EXPERIMENTAL!!!
+	If true, then Git will store Bloom filters for all parents of
+	merge commits in the commit-graph file.
+	This has no impact on repositories with linear history.  In
+	repositories with lots of merge commits this considerably
+	increases the runtime of writing the commit-graph file and its
+	size.  It has little benefit in pathspec-limited revision walks
+	with the default history simplification, but can speed up
+	`--full-history` considerably.
+	Defaults to false, ignored if `core.modifiedPathBloomFilters` is
+	false.
+
 core.useReplaceRefs::
 	If set to `false`, behave as if the `--no-replace-objects`
 	option was given on the command line. See linkgit:git[1] and
diff --git a/commit-graph.c b/commit-graph.c
index 32738ba4b8..9cce79d452 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -65,12 +65,23 @@ struct modified_path_bloom_filter_info {
 	 * -1 before that.
 	 */
 	uint64_t offset;
+	uint32_t merge_index_pos;
+	struct modified_path_bloom_filter_info *next;
 };
 
 static void free_modified_path_bloom_filter_info_in_slab(
 		struct modified_path_bloom_filter_info *bfi)
 {
+	/* The instance got as parameter is on the slab, don't free() it. */
 	bloom_filter_free(&bfi->filter);
+	bfi = bfi->next;
+	/* The rest is in a linked list, and must be free()d. */
+	while (bfi) {
+		struct modified_path_bloom_filter_info *prev = bfi;
+		bloom_filter_free(&bfi->filter);
+		bfi = bfi->next;
+		free(prev);
+	}
 }
 
 define_commit_slab(modified_path_bloom_filters,
@@ -1115,7 +1126,8 @@ struct write_commit_graph_context {
 	const struct split_commit_graph_opts *split_opts;
 
 	struct modified_path_bloom_filter_context {
-		unsigned use_modified_path_bloom_filters:1;
+		unsigned use_modified_path_bloom_filters:1,
+			 all_merge_parents:1;
 		unsigned int num_hashes;
 		/*
 		 * Number of paths to be added to "embedded" modified path
@@ -1143,6 +1155,7 @@ struct write_commit_graph_context {
 
 		/* Excluding embedded modified path Bloom filters */
 		uint64_t total_filter_size;
+		uint32_t num_merge_index_entries;
 
 		/* Used to find identical modified path Bloom filters */
 		struct hashmap dedup_hashmap;
@@ -1361,28 +1374,70 @@ static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f,
 
 		bfi = modified_path_bloom_filters_peek(
 				&modified_path_bloom_filters, commit);
+		for (; bfi; bfi = bfi->next) {
+			if (bfi->duplicate_of)
+				continue;
+			if (!bfi->filter.nr_bits)
+				continue;
+			if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS)
+				continue;
 
-		if (!bfi)
-			continue;
-		if (bfi->duplicate_of)
-			continue;
-		if (!bfi->filter.nr_bits)
-			continue;
-		if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS)
-			continue;
+			if (offset >> 62)
+				BUG("offset %lu is too large for the Modified Path Bloom Filter Index chunk",
+				    offset);
+
+			bfi->offset = offset;
+
+			filter_size = bloom_filter_bytes(&bfi->filter);
+
+			hashwrite_be32(f, bfi->filter.nr_bits);
+			hashwrite(f, bfi->filter.bits, filter_size);
+
+			offset += sizeof(uint32_t) + filter_size;
+		}
+	}
+	return 0;
+}
 
-		if (offset >> 62)
-			BUG("offset %lu is too large for the Modified Path Bloom Filter Index chunk",
-			    offset);
+static int write_graph_chunk_modified_path_bloom_merge_index(
+		struct hashfile *f, struct write_commit_graph_context *ctx)
+{
+	const uint64_t no_bloom_filter = GRAPH_MODIFIED_PATH_BLOOM_FILTER_NONE;
+	uint32_t pos = 0;
+	int i;
 
-		bfi->offset = offset;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *commit = ctx->commits.list[i];
+		struct modified_path_bloom_filter_info *bfi;
 
-		filter_size = bloom_filter_bytes(&bfi->filter);
+		display_progress(ctx->progress, ++ctx->progress_cnt);
 
-		hashwrite_be32(f, bfi->filter.nr_bits);
-		hashwrite(f, bfi->filter.bits, filter_size);
+		bfi = modified_path_bloom_filters_peek(
+				&modified_path_bloom_filters, commit);
 
-		offset += sizeof(uint32_t) + filter_size;
+		if (!bfi || !bfi->next)
+			/* No merge index entries for this commit. */
+			continue;
+
+		do {
+			if (bfi->duplicate_of) {
+				uint64_t offset = htonll(bfi->duplicate_of->offset);
+				hashwrite(f, &offset, sizeof(offset));
+			} else if (!bfi->filter.nr_bits) {
+				hashwrite(f, &no_bloom_filter,
+					  sizeof(no_bloom_filter));
+			} else if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) {
+				uint8_t filterdata[sizeof(uint64_t)];
+				memcpy(filterdata, bfi->filter.bits, sizeof(filterdata));
+				filterdata[0] |= 1 << 7;
+				hashwrite(f, filterdata, sizeof(filterdata));
+			} else if (bfi->offset != -1) {
+				uint64_t offset = htonll(bfi->offset);
+				hashwrite(f, &offset, sizeof(offset));
+			} else
+				BUG("modified path Bloom filter offset is still -1?!");
+			bfi->merge_index_pos = pos++;
+		} while ((bfi = bfi->next));
 	}
 	return 0;
 }
@@ -1403,7 +1458,19 @@ static int write_graph_chunk_modified_path_bloom_index(struct hashfile *f,
 		bfi = modified_path_bloom_filters_peek(
 				&modified_path_bloom_filters, commit);
 
-		if (!bfi || !bfi->filter.nr_bits) {
+		if (!bfi) {
+			hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter));
+		} else if (bfi->next) {
+			uint8_t filterdata[sizeof(uint64_t)];
+			if (!commit->parents->next)
+				BUG("uh-oh #1");
+			if (bfi->merge_index_pos == -1)
+				BUG("uh-oh #2");
+			memset(filterdata, 0, sizeof(filterdata));
+			filterdata[0] |= 1 << 6;
+			((uint32_t*)filterdata)[1] = htonl(bfi->merge_index_pos);
+			hashwrite(f, filterdata, sizeof(filterdata));
+		} else if (!bfi->filter.nr_bits) {
 			hashwrite(f, &no_bloom_filter, sizeof(no_bloom_filter));
 		} else if (bfi->filter.nr_bits == GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS) {
 			uint8_t filterdata[sizeof(uint64_t)];
@@ -1618,14 +1685,18 @@ static int handle_duplicate_modified_path_bloom_filter(
 
 static void create_modified_path_bloom_filter(
 		struct modified_path_bloom_filter_context *mpbfctx,
-		struct commit *commit)
+		struct commit *commit, int nth_parent)
 {
 	struct modified_path_bloom_filter_info *bfi;
 	struct object_id *parent_oid;
 	uintmax_t path_component_count;
+	int i;
 
 	if (!mpbfctx->use_modified_path_bloom_filters)
 		return;
+	if (nth_parent &&
+	    !mpbfctx->all_merge_parents)
+		return;
 
 	/*
 	 * This function can be called multiple times for the same commit
@@ -1635,11 +1706,25 @@ static void create_modified_path_bloom_filter(
 	 */
 	bfi = modified_path_bloom_filters_peek(&modified_path_bloom_filters,
 					       commit);
-	if (bfi && bfi->filter.nr_bits)
+	for (i = 0; i < nth_parent && bfi; i++, bfi = bfi->next);
+	if (i == nth_parent && bfi && bfi->filter.nr_bits)
 		return;
 
-	parent_oid = commit->parents ? &commit->parents->item->object.oid :
-				       NULL;
+	if (commit->parents) {
+		struct commit_list *p;
+		for (i = 0, p = commit->parents;
+		     i < nth_parent && p;
+		     i++, p = p->next);
+		if (!p)
+			BUG("couldn't find %dth parent of commit %s\n",
+			    nth_parent + 1, oid_to_hex(&commit->object.oid));
+		parent_oid = &p->item->object.oid;
+	} else {
+		if (nth_parent)
+			BUG("looking for the %dth parent of commit %s with no parents",
+			    nth_parent + 1, oid_to_hex(&commit->object.oid));
+		parent_oid = NULL;
+	}
 	mpbfctx->hashes_nr = 0;
 	strbuf_reset(&mpbfctx->prev_path);
 	diff_tree_oid(parent_oid, &commit->object.oid, "", &mpbfctx->diffopt);
@@ -1647,7 +1732,23 @@ static void create_modified_path_bloom_filter(
 
 	bfi = modified_path_bloom_filters_at(&modified_path_bloom_filters,
 					     commit);
+	if (!nth_parent) {
+		/* This is the right bloom_filter_info instance. */
+		if (mpbfctx->all_merge_parents &&
+		    commit->parents && commit->parents->next)
+			mpbfctx->num_merge_index_entries++;
+	} else {
+		/* i is one step ahead of bfi */
+		for (i = 1; i < nth_parent && bfi; i++, bfi = bfi->next);
+		if (bfi->next)
+			BUG("Huh?!?");
+		bfi->next = xcalloc(1, sizeof(*bfi));
+		bfi = bfi->next;
+		mpbfctx->num_merge_index_entries++;
+	}
+
 	bfi->offset = -1;
+	bfi->merge_index_pos = -1;
 	if (path_component_count > mpbfctx->embedded_limit)
 		bloom_filter_init(&bfi->filter, mpbfctx->num_hashes,
 				  path_component_count);
@@ -1667,10 +1768,20 @@ static void create_modified_path_bloom_filter(
 static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit)
 {
 	struct commit_list *parent;
+	int nth_parent;
+
+	if (!commit->parents) {
+		/* root commit */
+		create_modified_path_bloom_filter(&ctx->mpbfctx, commit, 0);
+		return;
+	}
 
-	create_modified_path_bloom_filter(&ctx->mpbfctx, commit);
+	for (parent = commit->parents, nth_parent = 0;
+	     parent;
+	     parent = parent->next, nth_parent++) {
+		create_modified_path_bloom_filter(&ctx->mpbfctx, commit,
+						  nth_parent);
 
-	for (parent = commit->parents; parent; parent = parent->next) {
 		if (!(parent->item->object.flags & REACHABLE)) {
 			ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
 			oidcpy(&ctx->oids.list[ctx->oids.nr], &(parent->item->object.oid));
@@ -2079,6 +2190,13 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			chunks[chunks_nr].write_fn = write_graph_chunk_modified_path_bloom_filters;
 			chunks_nr++;
 		}
+		if (ctx->mpbfctx.num_merge_index_entries) {
+			ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
+			chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_MERGE_INDEX;
+			chunks[chunks_nr].size = ctx->mpbfctx.num_merge_index_entries * sizeof(uint64_t);
+			chunks[chunks_nr].write_fn = write_graph_chunk_modified_path_bloom_merge_index;
+			chunks_nr++;
+		}
 		ALLOC_GROW(chunks, chunks_nr + 1, chunks_alloc);
 		chunks[chunks_nr].id = GRAPH_CHUNKID_MODIFIED_PATH_BLOOM_FILTER_INDEX;
 		chunks[chunks_nr].size = sizeof(uint8_t) +
@@ -2487,6 +2605,9 @@ int write_commit_graph(const char *obj_dir,
 		warning("not writing modified path Bloom filters with --split");
 	} else if (res) {
 		ctx->mpbfctx.use_modified_path_bloom_filters = 1;
+		if (!git_config_get_bool("core.modifiedPathBloomFiltersForAllMergeParents", &res) &&
+		    res)
+			ctx->mpbfctx.all_merge_parents = 1;
 		ctx->mpbfctx.num_hashes = GRAPH_MODIFIED_PATH_BLOOM_FILTER_DEFAULT_NR_HASHES;
 		ctx->mpbfctx.embedded_limit = GRAPH_MODIFIED_PATH_BLOOM_FILTER_EMBEDDED_NR_BITS / (ctx->mpbfctx.num_hashes * 10 / 7);
 

From patchwork Fri May 29 08:50:33 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578157
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 27484912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:51 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 075A920776
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:51 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="vX/hJaID"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727009AbgE2Ivt (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:49 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44460 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726923AbgE2Ivo (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:44 -0400
Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com
 [IPv6:2a00:1450:4864:20::444])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 28947C08C5C6
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:44 -0700 (PDT)
Received: by mail-wr1-x444.google.com with SMTP id y17so2504012wrn.11
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=RpWiAPkGbMrBYhRMZNhm1LsT741LFFYvJlM/7/Welho=;
        b=vX/hJaIDd9DG+lfuslti65P2YhCY3SY/SCKMtO57T2Rtoa31TyXrlRht49xi/9RoE/
         fqDkT9ygNIBJksenlFaeyLFHC+ShS6UpY0NuyRbDuG4ld8CduiVrjbb8CandU8snUlJA
         9BWVRl6OC2YJtWs4Nd5VXDJ1tDnQTkDyntBnBFbv+Z+rrACnHZ07ITlbDNdY3DKP+SzJ
         I/qRzIll6AnHiLD1zmkM/MZHG+tRXWifcMTFfHSjpOcrrl3U2e+ouDGh1GI1ZYORJsRn
         IhXNDuwqy2GBqHvfZ4Xl0vne/rWTOrsEudwVy2tIwRlTsLHsUiXH0bnetsnpP4ImBdey
         NgrQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=RpWiAPkGbMrBYhRMZNhm1LsT741LFFYvJlM/7/Welho=;
        b=BxDf53OcrJ3dJ1lEGA5Ydfw7C8SHlOMleyWkeLnupguAs95gwRHNbjgzFX0xRXaiwm
         yF/FF24e8NGQuQPygKY7qwwRX/aGQFCr4QQCay0CuZjPei+G+hDvfKAFtPYM7jhEZRSF
         PMacZk7xz2EShQs+tRpFteTYuC/jgUaWAe1JiG9ude13CPl/C+gsovxCtjrNvbWVc146
         2kNSFGqcfhN9xSMhLXzSQb99MeknCHW+1KmweTD807oi9UeIEKr/T1CU93vkkSVLF3Lu
         zkgzTUzVwxVbtlsrrbs8arY49YWCi+o4uwvJMzSiXuip1JIMKDymw22ClzIrSaUVFQ9Y
         lWTA==
X-Gm-Message-State: AOAM5303ZtWofCsyMc3JjXySJOGxy7OFEhWm+9r6JsYhGvGTJHD38aLx
        M/byJpXupyhgJjT0NFWw+d+X9dOj
X-Google-Smtp-Source: 
 ABdhPJx4Up8r7+Jdrw5Y8J9amTggxoroapi2PA2Ukeimg0w2td3QHcdxIfvDSKzTQKX8az2nzxqFDg==
X-Received: by 2002:adf:dec5:: with SMTP id i5mr3892939wrn.16.1590742302464;
        Fri, 29 May 2020 01:51:42 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.41
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:41 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 29/34] commit-graph: extract init and free
 write_commit_graph_context
Date: Fri, 29 May 2020 10:50:33 +0200
Message-Id: <20200529085038.26008-30-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

The write_commit_graph() function is just over 180 lines long, among
which 86 lines initialize the 'struct write_commit_graph_context'
instance and 32 lines free its resources.

Extract these into init_write_commit_graph_context() and
free_write_commit_graph_context() functions, respectively, to make
write_commit_graph() look much simpler, and, more importantly, to be
able to reuse those init and free functions in the next patch.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 102 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 62 insertions(+), 40 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 9cce79d452..b5cd4c863a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2570,19 +2570,16 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	strbuf_release(&path);
 }
 
-int write_commit_graph(const char *obj_dir,
-		       struct string_list *pack_indexes,
-		       struct string_list *commit_hex,
-		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts)
+static struct write_commit_graph_context *init_write_commit_graph_context(
+		const char *obj_dir, enum commit_graph_write_flags flags,
+		const struct split_commit_graph_opts *split_opts)
 {
 	struct write_commit_graph_context *ctx;
-	uint32_t i, count_distinct = 0;
 	size_t len;
 	int res = 0;
 
 	if (!commit_graph_compatible(the_repository))
-		return 0;
+		return NULL;
 
 	ctx = xcalloc(1, sizeof(struct write_commit_graph_context));
 	ctx->r = the_repository;
@@ -2637,6 +2634,7 @@ int write_commit_graph(const char *obj_dir,
 		}
 
 		if (ctx->num_commit_graphs_before) {
+			int i;
 			ALLOC_ARRAY(ctx->commit_graph_filenames_before, ctx->num_commit_graphs_before);
 			i = ctx->num_commit_graphs_before;
 			g = ctx->r->objects->commit_graph;
@@ -2664,8 +2662,64 @@ int write_commit_graph(const char *obj_dir,
 		ctx->oids.alloc = 1024;
 	ALLOC_ARRAY(ctx->oids.list, ctx->oids.alloc);
 
+	return ctx;
+}
+
+static void free_write_commit_graph_context(
+		struct write_commit_graph_context *ctx)
+{
+	free(ctx->graph_name);
+	free(ctx->commits.list);
+	free(ctx->oids.list);
+	free(ctx->obj_dir);
+
+	if (ctx->commit_graph_filenames_after) {
+		int i;
+		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
+			free(ctx->commit_graph_filenames_after[i]);
+			free(ctx->commit_graph_hash_after[i]);
+		}
+
+		for (i = 0; i < ctx->num_commit_graphs_before; i++)
+			free(ctx->commit_graph_filenames_before[i]);
+
+		free(ctx->commit_graph_filenames_after);
+		free(ctx->commit_graph_filenames_before);
+		free(ctx->commit_graph_hash_after);
+	}
+
+	if (ctx->mpbfctx.use_modified_path_bloom_filters) {
+		deep_clear_modified_path_bloom_filters(
+				&modified_path_bloom_filters,
+				free_modified_path_bloom_filter_info_in_slab);
+		strbuf_release(&ctx->mpbfctx.prev_path);
+		free(ctx->mpbfctx.hashed_prefix_lengths);
+		free(ctx->mpbfctx.hashes);
+		hashmap_free_entries(&ctx->mpbfctx.dedup_hashmap,
+				struct modified_path_bloom_filter_dedup_entry,
+				entry);
+	}
+
+	free(ctx);
+}
+
+int write_commit_graph(const char *obj_dir,
+		       struct string_list *pack_indexes,
+		       struct string_list *commit_hex,
+		       enum commit_graph_write_flags flags,
+		       const struct split_commit_graph_opts *split_opts)
+{
+	struct write_commit_graph_context *ctx;
+	uint32_t count_distinct = 0;
+	int res = 0;
+
+	ctx = init_write_commit_graph_context(obj_dir, flags, split_opts);
+	if (!ctx)
+		return 0;
+
 	if (ctx->append && ctx->r->objects->commit_graph) {
 		struct commit_graph *g = ctx->r->objects->commit_graph;
+		int i;
 		for (i = 0; i < g->num_commits; i++) {
 			const unsigned char *hash = g->chunk_oid_lookup + g->hash_len * i;
 			hashcpy(ctx->oids.list[ctx->oids.nr++].hash, hash);
@@ -2726,39 +2780,7 @@ int write_commit_graph(const char *obj_dir,
 	expire_commit_graphs(ctx);
 
 cleanup:
-	free(ctx->graph_name);
-	free(ctx->commits.list);
-	free(ctx->oids.list);
-	free(ctx->obj_dir);
-
-	if (ctx->commit_graph_filenames_after) {
-		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
-			free(ctx->commit_graph_filenames_after[i]);
-			free(ctx->commit_graph_hash_after[i]);
-		}
-
-		for (i = 0; i < ctx->num_commit_graphs_before; i++)
-			free(ctx->commit_graph_filenames_before[i]);
-
-		free(ctx->commit_graph_filenames_after);
-		free(ctx->commit_graph_filenames_before);
-		free(ctx->commit_graph_hash_after);
-	}
-
-	if (ctx->mpbfctx.use_modified_path_bloom_filters) {
-		deep_clear_modified_path_bloom_filters(
-				&modified_path_bloom_filters,
-				free_modified_path_bloom_filter_info_in_slab);
-		strbuf_release(&ctx->mpbfctx.prev_path);
-		free(ctx->mpbfctx.hashed_prefix_lengths);
-		free(ctx->mpbfctx.hashes);
-		hashmap_free_entries(&ctx->mpbfctx.dedup_hashmap,
-				struct modified_path_bloom_filter_dedup_entry,
-				entry);
-	}
-
-	free(ctx);
-
+	free_write_commit_graph_context(ctx);
 	return res;
 }
 

From patchwork Fri May 29 08:50:34 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578155
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4DBA4912
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:49 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 3302720723
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:49 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="HT7Nul9F"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727007AbgE2Ivs (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:48 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44476 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726961AbgE2Ivq (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:46 -0400
Received: from mail-wm1-x32d.google.com (mail-wm1-x32d.google.com
 [IPv6:2a00:1450:4864:20::32d])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 93BE0C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:45 -0700 (PDT)
Received: by mail-wm1-x32d.google.com with SMTP id n5so2470618wmd.0
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=DAartjGCZZ1D2UpyUna/mR81dwpkMcQdphefIU4DvJU=;
        b=HT7Nul9FfiiyKDB5O5Fd9/FRAIDCZoffzI8FSZ3H93zE7lhKei27gjVyLSlvJY/c8j
         Bu37H52377HCUcelgA761A53/P55ZYMZszVUZq9CsgAEJ3vHEMz7iYcBH9ldX0oV9Uam
         ChKIK3aLW9BrV68Sl6wcFNLt08krhC2x+ruX81tzIS4WLpQBYFc0zMlkC+t2ov8Bnweo
         tWMR4hTuKd0d0vB2lafJWl6AnFqD/TAMAUoQw8oahWZhqcvrZSiu+0kNWeqZxl7DZjf7
         ecdaCbN5zuJhyztzi/GbKdL0TKJBQsoa7brCxNjPcy5yxGedGN2c1C+wFnktGkkyb9H3
         kG8g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=DAartjGCZZ1D2UpyUna/mR81dwpkMcQdphefIU4DvJU=;
        b=R5cT/byL7Fj5Gs6WK84uPcz1ukcz9u/vwRkOFP7W6CDyHFX6jeF6EL2pl8eVKPgDz3
         gyXLDs1QVofdAFIBTtIPOp1XmJVbp4DeFr3s8i//PElW1SNoQzwAD0gK2k3VzlhYcJAH
         GrmYwEz4SYijCW5a6AzkqobYoUetYe6Wgg99bnHLt71dA3+Bw7ofX4uB6YVMUVfw2UjW
         V3XWxxJJhQhtFRG1rKmHpI6j5v52SmVMKVZUfxLac0+RnUTLLpcWo8VapcVyH5B3N78B
         rvms0J1VWfpyzNSbk5unafno4sJFVUcvE1/imSAZezxMCMmpWRiK506rRKszeNxML0T3
         yd/A==
X-Gm-Message-State: AOAM530zIeArul8MFY9r8jvG0hB5wUuZdNHwpvBQm9ImGsVTc9moiLSp
        pTpyWw81dhX0B5p9xAPSlMv/M6vE
X-Google-Smtp-Source: 
 ABdhPJzLcYO0tXT9bvSYQ6o98DHWBoJYXL/9GGlsi9cw3j/eEl2u6HXuFquGKmV1T+Jy7kR32jDsOQ==
X-Received: by 2002:a1c:5606:: with SMTP id k6mr7909237wmb.10.1590742304049;
        Fri, 29 May 2020 01:51:44 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.42
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:43 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 30/34] commit-graph: move write_commit_graph_reachable below
 write_commit_graph
Date: Fri, 29 May 2020 10:50:34 +0200
Message-Id: <20200529085038.26008-31-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

---
 commit-graph.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b5cd4c863a..4cbfce61bf 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1901,21 +1901,6 @@ static int add_ref_to_list(const char *refname,
 	return 0;
 }
 
-int write_commit_graph_reachable(const char *obj_dir,
-				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts)
-{
-	struct string_list list = STRING_LIST_INIT_DUP;
-	int result;
-
-	for_each_ref(add_ref_to_list, &list);
-	result = write_commit_graph(obj_dir, NULL, &list,
-				    flags, split_opts);
-
-	string_list_clear(&list, 0);
-	return result;
-}
-
 static int fill_oids_from_packs(struct write_commit_graph_context *ctx,
 				struct string_list *pack_indexes)
 {
@@ -2784,6 +2769,21 @@ int write_commit_graph(const char *obj_dir,
 	return res;
 }
 
+int write_commit_graph_reachable(const char *obj_dir,
+				 enum commit_graph_write_flags flags,
+				 const struct split_commit_graph_opts *split_opts)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+	int result;
+
+	for_each_ref(add_ref_to_list, &list);
+	result = write_commit_graph(obj_dir, NULL, &list,
+				    flags, split_opts);
+
+	string_list_clear(&list, 0);
+	return result;
+}
+
 #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
 static int verify_commit_graph_error;
 

From patchwork Fri May 29 08:50:35 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578159
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EC861739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:51 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id D3CBF207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:51:51 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="cWne7mYq"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727013AbgE2Ivt (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:49 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44480 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725971AbgE2Ivr (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:47 -0400
Received: from mail-wm1-x343.google.com (mail-wm1-x343.google.com
 [IPv6:2a00:1450:4864:20::343])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E7606C08C5C6
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:46 -0700 (PDT)
Received: by mail-wm1-x343.google.com with SMTP id c71so2287550wmd.5
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=DmlILvQSwYjkFmMhHfaHxPRbDZnbba2maoT6vuIuUjs=;
        b=cWne7mYqImNpUUZAfTKlCl3GDK9thIuDzMAOG9h8ouEeRsSKqrxR8DBkYEEH6NiZE9
         rsfHfWbweC9c1AjU4oQK8Ttxz0SVOvefKly3tJDhmSqUDG5gwwTf2ag4aGdgVYv0aI22
         69ypkSBe32jVsYWUxYy6DC6HZ8hxSjwIREGYxoEgUCWP1PqIxUnQTfQjEColmWaenEkc
         v33AIvQ7I0Oft3uzXqTFDgqBPZpPYqgn19k2BOVxos1pmvxFY3SJ7RiD+41BSec9BW09
         dyRRztnL2iMf4lGrNZmFSupz2rUvq2OGiF2CKUrJwUs6FlW8vc51H404aY26yZPNScvE
         FUyw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=DmlILvQSwYjkFmMhHfaHxPRbDZnbba2maoT6vuIuUjs=;
        b=qO64fqasNF9dduCxtAwiy8VuMLsMoS2uJM+CQ9ax4c85WUoHU0C+CC2aTPgUM6WDVE
         ZDSCK1RLtX2k45JpbjGiDZQZPOijX4wLbgZIYlz46EaJl3+VAtfCTEThY0sUDoQYMyCa
         OmOHg/E9Dza8xlFPBYco6KbPrtd1ydQCOCpXdQ3BT8GSfss3sNl60U/wY0oxs7/YWlx2
         ev+YHKyPJ9LRq9IGlVa2zoBfi+Rgh/0OJYd3IDzMm2GvymMUXnR+G/njhcRMldwGY1GR
         31FKgkTXJhU2xahnZPfeLsHzsjw9vrledP2NpeQCgu2xWacKZXsfS/WAsuHgv5j8gLH1
         cGaQ==
X-Gm-Message-State: AOAM5303YXNx5pGGuoCtP5PGNuuDRuF7/r6iGLcxv0nLtTqZz4m2qFoU
        h9eUh2xau9ah25h8Q+CRTu138BAK
X-Google-Smtp-Source: 
 ABdhPJxO3pfWFsjctNdyZ3rSti45AUskbs19/Uob1ad7TDzgVYDj7YaYbqXqBkdX4Hkmlets/10Z6w==
X-Received: by 2002:a1c:df57:: with SMTP id w84mr7987773wmg.52.1590742305442;
        Fri, 29 May 2020 01:51:45 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.44
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:44 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 31/34] t7007-show: make the first test compatible with the
 next patch
Date: Fri, 29 May 2020 10:50:35 +0200
Message-Id: <20200529085038.26008-32-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Otherwise it would fail with GIT_TEST_COMMIT_GRAPH=1.
---
 t/t7007-show.sh | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/t/t7007-show.sh b/t/t7007-show.sh
index 42d3db6246..93244cd718 100755
--- a/t/t7007-show.sh
+++ b/t/t7007-show.sh
@@ -4,16 +4,15 @@ test_description='git show'
 
 . ./test-lib.sh
 
-test_expect_success setup '
+test_expect_success 'showing a tag that points at a missing object' '
+	test_when_finished "git tag -d foo-tag" &&
 	echo hello world >foo &&
 	H=$(git hash-object -w foo) &&
 	git tag -a foo-tag -m "Tags $H" $H &&
 	HH=$(expr "$H" : "\(..\)") &&
 	H38=$(expr "$H" : "..\(.*\)") &&
-	rm -f .git/objects/$HH/$H38
-'
+	rm -f .git/objects/$HH/$H38 &&
 
-test_expect_success 'showing a tag that point at a missing object' '
 	test_must_fail git --no-pager show foo-tag
 '
 

From patchwork Fri May 29 08:50:36 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578175
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5A5CC739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:52:02 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 3F0FB207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:52:02 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="iIBLOAry"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727028AbgE2Iv5 (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:57 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44488 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726961AbgE2Ivs (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:48 -0400
Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com
 [IPv6:2a00:1450:4864:20::444])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 67402C03E969
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:48 -0700 (PDT)
Received: by mail-wr1-x444.google.com with SMTP id c3so2490226wru.12
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=ASAVWnZfh22QQeQWl73yInYANgZJd/78wSrozFID7lI=;
        b=iIBLOAryQszWkQnlry/7a7GBzfCsT+TpDRMmr3raWsYqw4PnvTXsiD1Ey6nbkcR+Lm
         gDBQe/9nXwZFp7pqn5W1VCAhLVzVsoR/94oEK6K4/+0ueB9NQeny1vH+htgG/1qTQn72
         70qDa1gxbzO4x4PhopWsbXEgPlrQyCWxPiZS3cAwbizCk8R2oBNnRM86jxvbOMFH3LU9
         w4gQnvSQzXW0akNIEcCjpF7QJ2qgXm4suJQ9zetqR4tO4xREJPl5LBShs5x7ZO1/ffe7
         ahs3+C6waba4Qr80GSB52woxba3l3z2XpNdmFuTmSgZFrZgujMAOgfji2yGHDsZ7qDkR
         DFNw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=ASAVWnZfh22QQeQWl73yInYANgZJd/78wSrozFID7lI=;
        b=L4sIPzfhnIDahMo+/incwkeLNdB/YgBp/4eXABGCPULE4C1JSEaEBY+cma8Y+lIwHr
         qFiwSiYkTTkJ4VQ4wqS4QH2aYrskbLtPPKO8r6GrGhwVHVo/e/VFsNx2XhkGlAg9GJDy
         oClGQCxen/+cLU51XXsMhCgdEiKooIp+gYK14zyTDqsC7yJLymcLwFErKXSh4CppWpwp
         FU9A2zQySXCvMUD4AInPv5khGNxi7sYARZmUvDOj/X/SnwjFWK9BzjudPjvhOgYJwpRD
         JEboiwB4+nX1tjCmhyA9kPP+nwxxYEX6cUgQQyuR+74aMOfOhExmlNSii16L3v0vx/Vb
         p5VA==
X-Gm-Message-State: AOAM531RJO7zBRhduqyW1L/33nwA3xs+6rEAx0eJF6L2kPs/eU7ipJu0
        GtL5QQUh76qBtBnODeraoj9nF/F/
X-Google-Smtp-Source: 
 ABdhPJwUDU6sPRRnRaRee/Mv4sWD0Ol0+dpcn8tnwNso/alxJW36e4euDzcuLyWJ2kd42yI0TTJoEw==
X-Received: by 2002:a05:6000:1042:: with SMTP id
 c2mr8213720wrx.351.1590742306583;
        Fri, 29 May 2020 01:51:46 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.45
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:45 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 32/34] PoC commit-graph: use revision walk machinery for
 '--reachable'
Date: Fri, 29 May 2020 10:50:36 +0200
Message-Id: <20200529085038.26008-33-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

[Not sure it is legal to do this.  I hope it's legal.  It would be
great if it were legal.  But I really doubt it's legal.  FWIW, at
least the test suite still passes, even with GIT_TEST_COMMIT_GRAPH=1.
Or we have a hole in test coverage...]

Our revision walk machinery knows a thing or two about how to traverse
the history in an order that is friendlier to the delta base cache.
The hand-rolled revision walk in commit-graph.c:close_reachable() is
unaware of any such tricks: it is just a simple loop iterating over an
array of commits, appending unseen parents to the end.  This didn't
really matter in the past, because we only accessed commit objects
while writing the commit-graph.  Now, however, we have modified path
Bloom filters, and to fill them we need access to tree objects as well
to gather all modified paths, which puts more strain on the caches,
and the order in which we traverse the commits becomes relevant.

So let's use the regular revision walk machinery (i.e. 'struct
rev_info', setup_revisions(), get_revision() and friends) for
'--reachable' to speed up writing a commit-graph with modified path
Bloom filters.

However, stick to the current algorithm when scanning pack files in
commits, because passing all packed commits to setup_revisions() might
do more harm than good (presumably, I didn't check; scanning all packs
for commits can be hopelessly inefficient anyway).  '--stdin-commits',
too, could benefit from using the revision walk machinery when it's
fed with branch tips from 'git for-each-ref', or fare worse when it's
fed all commits from 'git rev-list' (didn't check this, either)... so
let's just leave it using the current algorithm for now.

The table below shows the reduced runtime and max RSS of writing a
commit-graph file from scratch with '--reachable' and modified path
Bloom filters, i.e. that of:

  git -c core.modifiedPathBloomFilters=1 commit-graph write --reachable

                       Runtime                         Max RSS
                  before     after               before      after
  --------------------------------------------------------------------------
  android-base    57.722s   40.113s   -30.5%     711804k    697624k    -2.0%
  cmssw           27.977s   25.154s   -10.1%     980884k    786196k   -19.9%
  cpython         10.161s    8.956s   -11.9%     331736k    328704k    -0.9%
  gcc            114.980s   36.975s   -67.9%     561928k    484212k   -13.8%
  gecko-dev      132.557s   97.314s   -26.6%    2203544k   2161480k    -1.9%
  git              8.097s    5.215s   -35.6%     240564k    189652k   -21.2%
  glibc            4.693s    3.996s   -14.9%     215924k    207572k    -3.9%
  homebrew-core   56.661s   56.384s    -0.5%     469668k    475596k    +1.3%
  jdk             20.355s   18.936s    -7.0%     327244k    322316k    -1.5%
  linux          215.235s   98.991s   -54.0%    1549852k   1445028k    -6.8%
  llvm-project    48.659s   30.933s   -36.4%     783740k    736988k    -6.0%
  rails            5.804s    5.389s    -7.2%     284524k    284528k       0%
  rust            20.151s   12.921s   -35.9%     687840k    664104k    -3.5%
  webkit          31.226s   30.341s    -2.8%    2018396k   1902884k    -5.7%
---
 commit-graph.c | 110 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 94 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 4cbfce61bf..1677e4fb3e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1703,6 +1703,9 @@ static void create_modified_path_bloom_filter(
 	 * (a commit can be present in multiple pack files, or multiple
 	 * refs can point to the same commit) so check first whether we have
 	 * already created a modified path Bloom filter for it.
+	 *
+	 * TODO: with '--reachable' we can now be sure that we only look at
+	 * each commit only once, so this is unnecessary in that code path.
 	 */
 	bfi = modified_path_bloom_filters_peek(&modified_path_bloom_filters,
 					       commit);
@@ -1891,16 +1894,6 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
-static int add_ref_to_list(const char *refname,
-			   const struct object_id *oid,
-			   int flags, void *cb_data)
-{
-	struct string_list *list = (struct string_list *)cb_data;
-
-	string_list_append(list, oid_to_hex(oid));
-	return 0;
-}
-
 static int fill_oids_from_packs(struct write_commit_graph_context *ctx,
 				struct string_list *pack_indexes)
 {
@@ -2773,14 +2766,99 @@ int write_commit_graph_reachable(const char *obj_dir,
 				 enum commit_graph_write_flags flags,
 				 const struct split_commit_graph_opts *split_opts)
 {
-	struct string_list list = STRING_LIST_INIT_DUP;
-	int result;
+	struct write_commit_graph_context *ctx;
+	struct rev_info revs;
+	const char *revs_argv[] = { NULL, "--all", NULL };
+	struct commit *commit;
+	int result = 0;
+
+	ctx = init_write_commit_graph_context(obj_dir, flags, split_opts);
+	if (!ctx)
+		return 0;
+
+	/*
+	 * Some git commands write a commit-graph in-process at the end,
+	 * prevent their revision walk to interfere with ours.
+	 */
+	clear_object_flags(SEEN | ADDED | SHOWN | TOPO_WALK_EXPLORED |
+			   TOPO_WALK_INDEGREE | UNINTERESTING);
+
+	init_revisions(&revs, NULL);
+	setup_revisions(ARRAY_SIZE(revs_argv) - 1, revs_argv, &revs, NULL);
+	if (prepare_revision_walk(&revs)) {
+		error(_("revision walk setup failed"));
+		result = -1;
+		goto cleanup;
+	}
+
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+				_("Gathering commit info"),
+				0);
+	while ((commit = get_revision(&revs))) {
+		display_progress(ctx->progress, ctx->oids.nr + 1);
+		ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
+		oidcpy(&ctx->oids.list[ctx->oids.nr], &commit->object.oid);
+		ctx->oids.nr++;
+
+		if (!commit->parents) {
+			create_modified_path_bloom_filter(&ctx->mpbfctx,
+							  commit, 0);
+		} else {
+			struct commit_list *parent;
+			int nth_parent;
+
+			for (parent = commit->parents, nth_parent = 0;
+			     parent;
+			     parent = parent->next, nth_parent++)
+				create_modified_path_bloom_filter(&ctx->mpbfctx,
+								  commit, nth_parent);
+		}
+	}
+	stop_progress(&ctx->progress);
+
+	if (ctx->oids.nr >= GRAPH_EDGE_LAST_MASK) {
+		error(_("the commit graph format cannot write %d commits"), ctx->oids.nr);
+		result = -1;
+		goto cleanup;
+	}
 
-	for_each_ref(add_ref_to_list, &list);
-	result = write_commit_graph(obj_dir, NULL, &list,
-				    flags, split_opts);
+	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
+
+	/*
+	 * TODO: the revision walking loop above could fill ctx->commits
+	 * straight away, counting extra edges as well.
+	 */
+	ctx->commits.alloc = ctx->oids.nr;
+	ALLOC_ARRAY(ctx->commits.list, ctx->commits.alloc);
+
+	/*
+	 * Despite its unassuming name, this function removes duplicate
+	 * commits/oids and, worse, it does counts extra edges as well.
+	 */
+	copy_oids_to_commits(ctx);
+
+	if (!ctx->commits.nr)
+		goto cleanup;
+
+	if (ctx->split) {
+		split_graph_merge_strategy(ctx);
 
-	string_list_clear(&list, 0);
+		merge_commit_graphs(ctx);
+	} else
+		ctx->num_commit_graphs_after = 1;
+
+	compute_generation_numbers(ctx);
+
+	result = write_commit_graph_file(ctx);
+
+	if (ctx->split)
+		mark_commit_graphs(ctx);
+
+	expire_commit_graphs(ctx);
+
+cleanup:
+	free_write_commit_graph_context(ctx);
 	return result;
 }
 

From patchwork Fri May 29 08:50:37 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578173
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A2530739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:52:01 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 84672207F5
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:52:01 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="H0qTUysw"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727040AbgE2IwA (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:52:00 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44492 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727017AbgE2Ivu (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:50 -0400
Received: from mail-wm1-x341.google.com (mail-wm1-x341.google.com
 [IPv6:2a00:1450:4864:20::341])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A30F2C08C5C6
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:49 -0700 (PDT)
Received: by mail-wm1-x341.google.com with SMTP id r15so2416777wmh.5
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=uX6HbOpHQEtIxexZFdu0oXN+a6kSCCA/+9ji0J8xGcE=;
        b=H0qTUysw9qgAH8lB0jSeLqj7lsnc59qmKgNtge1Pd9q3OTKk7Rx0KFMsKUHetRmHHJ
         CEO7IrfkxQ4n2MA4xe8LOk06Uj21g1CbuUpAWSg6MrXbOeTd0mjVvLbVBR+I073DqH20
         khiCK/WpzHQ5DhRs6G1O41PxrsCHXMx3p8qs8TwU9fFwDM4tEZcSWb51F0BDi79ozEft
         XVO4ABvurO56VFbGha+s6+YyrCgFu1LfOzjEfTrwqWDUEpRCtUWlGTwPN0Ola8Xcv2OP
         MEkBlMBygc37fT5alZKwjl5vs4ULxIYMj3kyt8848X8MzDyIm4lJOK7ExOrvUFu9soNg
         AaLQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=uX6HbOpHQEtIxexZFdu0oXN+a6kSCCA/+9ji0J8xGcE=;
        b=rBfBuvr/y5seQK/itnxoX+i9CePAUCTyiFjEaA+tC+Dcfi4PrXmxoKq9U9CqUONNO3
         O9fmvxG8I/4NVSaf4wNyFud/AjgVWy/C8+SPCGU/vV1s06lDQF+ALyYIEMR+XhysPiUx
         PdXhaffmTVGwEUkysj+DKhVUnQE7bCLOvTyf+z9/nw7Kj7e5c/vDRYxiYzo4uI1dz/AT
         RIaUNjqVoM/GzSVuS4+DR8lswp/520mqHFaIRQUNOi4O1reHFbz542Z4W6f2MIn1cPGs
         8eeSAChnrtdgIMFevZ/tdKZRWTHLt6WhGqCQX9LGh4U9yCni503ESlVO7g4ihPCr2npS
         l/cg==
X-Gm-Message-State: AOAM531OEHTEzPUBTs3CEkckd9tt2me2YdC0ArmB7bYyoN3ZnejXcEPM
        L6QhfHwVONbOHXbhY6s/kibiDr1q
X-Google-Smtp-Source: 
 ABdhPJy9xQo/GNjKb8jk4LjfcXruohXfo3k8cF5KmmNHChNfc2BMbC5hvphjFTHRK6DAR+TzzReU4w==
X-Received: by 2002:a1c:e389:: with SMTP id a131mr7784267wmh.46.1590742307901;
        Fri, 29 May 2020 01:51:47 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.46
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:47 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 33/34] commit-graph: write modified path Bloom filters in
 "history order"
Date: Fri, 29 May 2020 10:50:37 +0200
Message-Id: <20200529085038.26008-34-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

[Explain!  Try topo order!]

I don't have recent benchmark results at hand yet, but as far as I can
recall some old results this reduces the time spent loading and
checking modified path Bloom filters by another ~10%.  Checking Bloom
filters is of course only a small part of the whole runtime of a
pathspec-limited revision walk, so the overall improvement is only
about 1-2%.
Oh, well, I expected more from this change... But perhaps it has
larger impact with less embedded modified path Bloom filters?

This is the last patch in this series that improves pathspec-limited
revision walk performance, so it's time to compare average runtime and
memory usage of

  git rev-list HEAD -- "$path"

for 5000+ randomly selected paths from each repository with and
without modified path Bloom filters:

                  Average runtime     Average  |  Average max RSS
                 without      with    speedup  |  without    with
  ---------------------------------------------+----------------------------
  android-base   0.8780s    0.1456s     6.03x  |  387MB    91.2MB    -76.5%
  cmssw          0.3143s    0.0313s    10.04x  |  181MB    37.4MB    -79.4%
  cpython        0.7453s    0.0810s     9.20x  |  159MB    43.8MB    -72.5%
  elasticsearch  0.1492s    0.0136s    10.95x  |  134MB    21.8MB    -83.7%
  gcc            7.1852s    0.2114s    34.00x  |  297MB    91.1MB    -69.3%
  gecko-dev      4.6113s    0.4815s     9.58x  |  832MB   175.2MB    -79.0%
  git            0.6180s    0.0310s    19.94x  |  131MB    31.5MB    -76.0%
  glibc          0.5618s    0.0282s    19.92x  |  135MB    25.0MB    -81.6%
  go             0.4913s    0.0403s    12.19x  |  130MB    24.7MB    -81.1%
  jdk            0.0482s    0.0068s     7.09x  |   52MB    27.1MB    -48.2%
  linux          0.7043s    0.0873s     8.08x  |  438MB   150.9MB    -65.5%
  llvm-project   2.6844s    0.4135s     6.49x  |  384MB   126.5MB    -67.1%
  rails          0.2784s    0.0391s     7.12x  |   88MB    28.8MB    -67.3%
  rust           0.7757s    0.0439s    17.67x  |  344MB    42.0MB    -87.8%
  tensorflow     0.6258s    0.0487s    12.85x  |  233MB    36.7MB    -84.2%
  webkit         1.9137s    0.2420s     7.91x  |  940MB    84.7MB    -91.0%

The astute reader may notice that in several cases the achieved
speedup is a bit higher than the calculated potential speedup shown in
the log message of the patch adding the modified path Bloom filter
specs, e.g. 34x vs. 29.7x in the gcc repository.  I suspect that by
eliminating much of the tree-diff overhead we load a lot less (tree)
objects, putting less strain on the caches, and in turn making other
parts of the process faster as well.

Alas, this is not without extra cost: writing the commit-graph file
with modified path Bloom filters takes a lot longer and the resulting
commit-graph file is considerably larger:

          Writing a commit-graph file from      |
             scratch with '--reachable'         | commit-graph file size
                 without       with             |   without      with
  ----------------------------------------------+--------------------------------
  android-base    6.230s     40.880s     6.56x  |  25077512    31278605    +24.7%
  cmssw           3.177s     25.691s     8.09x  |  13017516    23971497    +84.1%
  cpython         1.344s      8.951s     6.66x  |   6808068     8152146    +19.7%
  gcc             2.886s     36.917s    12.79x  |  12839660    18068286    +40.7%
  gecko-dev       9.331s     97.729s    10.47x  |  43335208    66568549    +53.6%
  git             0.750s      5.245s     6.99x  |   3388208     4011595    +18.3%
  glibc           0.565s      4.146s     7.34x  |   2832460     3936588    +38.9%
  homebrew-cask   1.083s     27.024s    24.95x  |   5928644     6859160    +15.7%
  homebrew-core   1.702s     55.478s    32.60x  |   9171324    10509040    +14.5%
  jdk             0.568s     19.418s    34.19x  |   3237508    15858410   +389.8%
  linux          13.525s    100.837s     7.46x  |  49800744    81939893    +64.5%
  llvm-project    4.275s     31.188s     7.30x  |  19309116    25336867    +31.2%
  rails           0.986s      5.607s     5.69x  |   4990252     6317541    +26.5%
  rust            1.260s     13.250s    10.52x  |   6053824     8601440    +42.0%
  webkit          3.658s     30.469s     8.33x  |  12288620    19438512    +58.1%
---
 commit-graph.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1677e4fb3e..8eb0cbedaf 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -67,6 +67,8 @@ struct modified_path_bloom_filter_info {
 	uint64_t offset;
 	uint32_t merge_index_pos;
 	struct modified_path_bloom_filter_info *next;
+	/* TODO: need better names for 'next' and 'next_commit_bfi' */
+	struct modified_path_bloom_filter_info *next_commit_bfi;
 };
 
 static void free_modified_path_bloom_filter_info_in_slab(
@@ -1159,6 +1161,10 @@ struct write_commit_graph_context {
 
 		/* Used to find identical modified path Bloom filters */
 		struct hashmap dedup_hashmap;
+
+		/* To chain up Bloom filters in history order */
+		struct modified_path_bloom_filter_info *first_commit_bfi;
+		struct modified_path_bloom_filter_info *prev_commit_bfi;
 	} mpbfctx;
 };
 
@@ -1363,18 +1369,18 @@ static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f,
 		struct write_commit_graph_context *ctx)
 {
 	uint64_t offset = 0;
-	int i;
+	struct modified_path_bloom_filter_info *next_commit_bfi;
 
-	for (i = 0; i < ctx->commits.nr; i++) {
-		struct commit *commit = ctx->commits.list[i];
-		struct modified_path_bloom_filter_info *bfi;
-		unsigned int filter_size;
+	next_commit_bfi = ctx->mpbfctx.first_commit_bfi;
+	while (next_commit_bfi) {
+		struct modified_path_bloom_filter_info *bfi = next_commit_bfi;
+
+		next_commit_bfi = next_commit_bfi->next_commit_bfi;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
 
-		bfi = modified_path_bloom_filters_peek(
-				&modified_path_bloom_filters, commit);
 		for (; bfi; bfi = bfi->next) {
+			unsigned int filter_size;
 			if (bfi->duplicate_of)
 				continue;
 			if (!bfi->filter.nr_bits)
@@ -1762,6 +1768,14 @@ static void create_modified_path_bloom_filter(
 	bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes,
 			      mpbfctx->hashes_nr);
 
+	if (!mpbfctx->first_commit_bfi) {
+		mpbfctx->first_commit_bfi = bfi;
+		mpbfctx->prev_commit_bfi = bfi;
+	} else if (!nth_parent) {
+		mpbfctx->prev_commit_bfi->next_commit_bfi = bfi;
+		mpbfctx->prev_commit_bfi = bfi;
+	}
+
 	if (path_component_count > mpbfctx->embedded_limit &&
 	    !handle_duplicate_modified_path_bloom_filter(mpbfctx, bfi))
 		mpbfctx->total_filter_size += sizeof(uint32_t) +

From patchwork Fri May 29 08:50:38 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
X-Patchwork-Id: 11578171
Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BD244739
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:52:00 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id A3DEF207BC
	for <patchwork-git@patchwork.kernel.org>;
 Fri, 29 May 2020 08:52:00 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="KM1+DVSE"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727037AbgE2Iv6 (ORCPT
        <rfc822;patchwork-git@patchwork.kernel.org>);
        Fri, 29 May 2020 04:51:58 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44498 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727018AbgE2Ivv (ORCPT <rfc822;git@vger.kernel.org>);
        Fri, 29 May 2020 04:51:51 -0400
Received: from mail-wm1-x343.google.com (mail-wm1-x343.google.com
 [IPv6:2a00:1450:4864:20::343])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 558A8C08C5C8
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:51 -0700 (PDT)
Received: by mail-wm1-x343.google.com with SMTP id v19so2340592wmj.0
        for <git@vger.kernel.org>; Fri, 29 May 2020 01:51:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=utUv9LvPID6PTzOb9lGP8mk98660ET39XQtnFj1wxP8=;
        b=KM1+DVSEp/fzm4gIx8zdsY34+bUsnfP1aKv6TDxW6xrNVNceF/hFEhPmU5znhsrtSm
         cWuqSdQKwgHKwlMs2vt0WF3fL/U6i2+6tKCiEhfyPevkgItGrX2Dfzh7Ro8Bd5hwfhHm
         HAyZUNLVlnF7WjZWzlJL5W7XwYvb2nv9bR/9EPbnMhHMgLuQ7bQCH/seYdlmhHd4MHBU
         ZP9T8yt62oBmHeIsVo4s/D2cctKF2BdSyneFDiUy77sh2Z9F+w5V5AQC48XXo7aQ1Zmy
         dZ0u0Zoxmi8BPodkEuJKoDUlHMHuzZHouJpUySjDjBFM6DdtNtXIJko5VZISCSF+EW3Q
         vOpw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=utUv9LvPID6PTzOb9lGP8mk98660ET39XQtnFj1wxP8=;
        b=NiHupPUXEAn9MpIxFiU2t7poZgALvgQ5+NVs9BqC3Ycnt+0At+r3wbBG/Am7GJpVjC
         rdc+X+dWjfTi8PRgstnlB43i1BRwwGLQkYGz6f+bXnVJOjxQkLYvGVVplCbSJ6CdkgK3
         ez0l0eJo2cdlP7t3Fsb7OLS0BbbPGr9PH9sWd2FOhMDmn5yeKct4kmyzOl+Nk1cPQYip
         VNewEpy63gOcss0E9Piepo3Yr0n99wZ5vW8aMHfUhpj3eyNJrriT89xMgQkzaMhV9Ca0
         W4SleRl0YgwoNGcpUWv3zeTTEpfDatV3cYkLGKv2r5xrCkcqC3u74mqxBPfOayI0h+YG
         OfUA==
X-Gm-Message-State: AOAM533rZS/8UxmrTs40D0iHu0ExuKB4272Q3HtW2rOm8DJIoMbuosVt
        HGXAuw8whaE9e8kJDKazzCWnkiuo
X-Google-Smtp-Source: 
 ABdhPJyNDNiF/IhYWKlaaBlRf9wJuH3+TJe7JcywpC69eRrExfk8FfX2T8BEjXgIEf+yCq6kN5c4Bw==
X-Received: by 2002:a7b:c18a:: with SMTP id y10mr7917936wmi.73.1590742309726;
        Fri, 29 May 2020 01:51:49 -0700 (PDT)
Received: from localhost.localdomain (84-236-109-105.pool.digikabel.hu.
 [84.236.109.105])
        by smtp.gmail.com with ESMTPSA id
 f11sm3525255wrj.2.2020.05.29.01.51.48
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 29 May 2020 01:51:49 -0700 (PDT)
From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PATCH 34/34] commit-graph: use modified path Bloom filters with
 wildcards, if possible
Date: Fri, 29 May 2020 10:50:38 +0200
Message-Id: <20200529085038.26008-35-szeder.dev@gmail.com>
X-Mailer: git-send-email 2.27.0.rc1.431.g5c813f95dc
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>
References: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Sender: git-owner@vger.kernel.org
Precedence: bulk
List-ID: <git.vger.kernel.org>
X-Mailing-List: git@vger.kernel.org

Modified path Bloom filter don't store the names of modified paths,
they only set a couple of bits based on those paths' hashes.
Consequently, they can only be used when looking for the history of a
concrete path, so we disabled them when looking for pathspecs with
wildcards.

However, if the pathspec has "wildcard-less" leading directories, then
we can use modified path Bloom filters to skip commits that don't
modify those leading directories.  As a result, something like:

  git -c core.modifiedPathBloomFilters=1 rev-list HEAD -- 'compat/win32/pthread.*'

will take only ~0.045s instead of ~1.24s, achieving over 27x speedup.

For comparison, letting the shell do the wildcard matching, i.e. the
equivalent of using the pathspecs 'compat/win32/pthread.c' and
'compat/win32/pthread.h' takes ~0.311s without using modified path
Bloom filters: apparently tree-diff with wildcards can be considerably
more expensive that without wildcards, even if the wildcard is in the
last path component and that directory only contains a dozen files.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 44 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 8eb0cbedaf..db43877426 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1023,9 +1023,9 @@ static void compute_modified_path_bloom_hashes_for_path(const char *path,
 void init_pathspec_bloom_fields(struct repository *r,
 				struct pathspec *pathspec)
 {
-	const unsigned bloom_compatible_magic = PATHSPEC_LITERAL;
+	const unsigned bloom_compatible_magic = PATHSPEC_LITERAL | PATHSPEC_GLOB;
 	struct commit_graph *graph = r->objects->commit_graph;
-	int i;
+	int i, can_use_modified_path_bloom_filters;
 
 	if (!graph)
 		return;
@@ -1033,15 +1033,14 @@ void init_pathspec_bloom_fields(struct repository *r,
 		return;
 	if (!pathspec->nr)
 		return;
-	if (pathspec->has_wildcard)
-		return;
 	if (pathspec->magic & ~bloom_compatible_magic)
 		return;
 
+	can_use_modified_path_bloom_filters = 1;
 	for (i = 0; i < pathspec->nr; i++) {
 		struct pathspec_item *pi = &pathspec->items[i];
 		const char *path = pi->match, *p;
-		size_t len = pi->len;
+		size_t nowildcard_len = pi->nowildcard_len;
 		int path_component_nr = 0, j;
 		uint32_t *hashes;
 		struct bloom_filter embedded_bf;
@@ -1051,14 +1050,29 @@ void init_pathspec_bloom_fields(struct repository *r,
 		 * slashes, but a trailing slash might still be present,
 		 * "remove" it.
 		 */
-		if (path[len - 1] == '/')
-			len--;
+		if (path[nowildcard_len - 1] == '/')
+			nowildcard_len--;
 
 		p = path;
 		do {
 			p = strchrnul(p + 1, '/');
-			path_component_nr++;
-		} while (p - path < len);
+			if (p - path <= nowildcard_len)
+				path_component_nr++;
+		} while (p - path < nowildcard_len);
+		/*
+		 * If a pathspec uses wildcards but has wildcard-less
+		 * leading directories, then we can use modified path Bloom
+		 * filters to skip commits that don't modify those leading
+		 * directories.
+		 * However, if there is even one pathspec that has a wilcard
+		 * in its first path component, then we have no choice but
+		 * to run tree-diff anyway, so don't bother with Bloom
+		 * filters at all in that case.
+		 */
+		if (!path_component_nr) {
+			can_use_modified_path_bloom_filters = 0;
+			break;
+		}
 
 		pi->modified_path_bloom_hashes_nr = path_component_nr * graph->num_modified_path_bloom_hashes;
 		ALLOC_ARRAY(pi->modified_path_bloom_hashes,
@@ -1084,7 +1098,17 @@ void init_pathspec_bloom_fields(struct repository *r,
 				      pi->modified_path_bloom_hashes_nr);
 	}
 
-	pathspec->can_use_modified_path_bloom_filters = 1;
+	if (can_use_modified_path_bloom_filters) {
+		pathspec->can_use_modified_path_bloom_filters = 1;
+	} else {
+		int j;
+		for (j = 0; j < i; j++) {
+			struct pathspec_item *pi = &pathspec->items[j];
+			FREE_AND_NULL(pi->modified_path_bloom_hashes);
+			pi->modified_path_bloom_hashes_nr = 0;
+			pi->modified_path_bloom_mask = 0;
+		}
+	}
 }
 
 struct packed_commit_list {