From patchwork Sat Sep 3 00:36:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shaoxuan Yuan X-Patchwork-Id: 12964863 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD9A3C38145 for ; Sat, 3 Sep 2022 00:38:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231603AbiICAiU (ORCPT ); Fri, 2 Sep 2022 20:38:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59514 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231439AbiICAiT (ORCPT ); Fri, 2 Sep 2022 20:38:19 -0400 Received: from mail-oa1-x2c.google.com (mail-oa1-x2c.google.com [IPv6:2001:4860:4864:20::2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CF99A9D8CC for ; Fri, 2 Sep 2022 17:38:17 -0700 (PDT) Received: by mail-oa1-x2c.google.com with SMTP id 586e51a60fabf-11e7e0a63e2so8791051fac.4 for ; Fri, 02 Sep 2022 17:38:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=jOtJ79aFd8RCKOpzHr11un7LMCyqar9A66DYrIrSupY=; b=BrQfB2nrXlfkoeuuEeUKJvx83+LI4oEmGXyvZ6gCXbj9WMxZQgUxs+fgqdRI1OUhY9 eHsmRobaq71YJtdJg6qiFMhng0Yh89usJtD/c36bNPHYXmg1RUy4yNP/a5pr2f2F95T2 w8oXyH4GY2rX0qrbmPpZ5D3gqkE4M0fG/8CaWK4AHJacLcwsTSkUa7b3GLuZJQdnII4S WZYDBgFIpz9b9eAzLRh1zZ47FMsV1O1uv2hR1CoQJBvjhvbzhMiIf7BQ8dus0Lsbse/y hJ72CqcCe/t69WKY/xkbcGKgE+5xQTbBVVr3WaZkFORc+kfBFnfKsyiMCpivo+yrVdGE /rZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=jOtJ79aFd8RCKOpzHr11un7LMCyqar9A66DYrIrSupY=; b=ttIXsvl2gBQt2C+reC4uBUkeQBWdzZc5SB7nN0fkGYr3ztva9EHpNHETH1M2vunCYx P1YClfz1EmVTrzzNHdsSOQZwEKterhylCxuB3DKAXNPtt3wjGcSc6lkrPRk0bqxL8Y85 qfdE3eHENrQlaEQFVlxe1Lv7xJNwQHLRSPV9STyChzBusBCHRPtwi28Sf9bvEwHPhz0t mcw7O0K5RD7gkPUJn0c1/AYLitkjqqcE6YdHzXynsOQoO4JWHFSY4PMst+X2eVDc2HLI GAsqPsanMTv6lTkmx8kENqBIcHWEdEblichiY6qLqBcr9cWOd55YKjA+025dWtgCvd8M e2Cw== X-Gm-Message-State: ACgBeo3pyp0lQsOFf4M1TDYD7CBkhQbHdTuA902oPA3Zx4jxH91pLgPc BqLOJ42viLGD7+qkx8J1XfMp9UhxodM= X-Google-Smtp-Source: AA6agR63X4OhP5qhOv/Ni1aNdp6C5fen02taA/YX5k7/d3RT2h5WKQTlvs7pbzW3r+JznjDeYuBz3w== X-Received: by 2002:a54:4696:0:b0:343:46c5:9b2c with SMTP id k22-20020a544696000000b0034346c59b2cmr3148641oic.8.1662165496991; Fri, 02 Sep 2022 17:38:16 -0700 (PDT) Received: from ffyuanda.localdomain (99-110-131-145.lightspeed.irvnca.sbcglobal.net. [99.110.131.145]) by smtp.gmail.com with ESMTPSA id n6-20020a4ad626000000b00435785e7b49sm1172472oon.19.2022.09.02.17.38.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Sep 2022 17:38:16 -0700 (PDT) From: Shaoxuan Yuan To: git@vger.kernel.org Cc: derrickstolee@github.com, vdye@github.com, gitster@pobox.com, Shaoxuan Yuan Subject: [PATCH v4 1/3] builtin/grep.c: add --sparse option Date: Fri, 2 Sep 2022 17:36:21 -0700 Message-Id: <20220903003623.64750-2-shaoxuan.yuan02@gmail.com> X-Mailer: git-send-email 2.37.0 In-Reply-To: <20220903003623.64750-1-shaoxuan.yuan02@gmail.com> References: <20220817075633.217934-1-shaoxuan.yuan02@gmail.com> <20220903003623.64750-1-shaoxuan.yuan02@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Add a --sparse option to `git-grep`. When the '--cached' option is used with the 'git grep' command, the search is limited to the blobs found in the index, not in the worktree. If the user has enabled sparse-checkout, this might present more results than they would like, since the files outside of the sparse-checkout are unlikely to be important to them. Change the default behavior of 'git grep' to focus on the files within the sparse-checkout definition. To enable the previous behavior, add a '--sparse' option to 'git grep' that triggers the old behavior that inspects paths outside of the sparse-checkout definition when paired with the '--cached' option. Suggested-by: Victoria Dye Helped-by: Derrick Stolee Helped-by: Victoria Dye Signed-off-by: Shaoxuan Yuan --- Documentation/git-grep.txt | 5 ++++- builtin/grep.c | 10 +++++++++- t/t7817-grep-sparse-checkout.sh | 34 +++++++++++++++++++++++++++------ 3 files changed, 41 insertions(+), 8 deletions(-) diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt index 58d944bd57..bdd3d5b8a6 100644 --- a/Documentation/git-grep.txt +++ b/Documentation/git-grep.txt @@ -28,7 +28,7 @@ SYNOPSIS [-f ] [-e] [--and|--or|--not|(|)|-e ...] [--recurse-submodules] [--parent-basename ] - [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | ...] + [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | ...] [--] [...] DESCRIPTION @@ -45,6 +45,9 @@ OPTIONS Instead of searching tracked files in the working tree, search blobs registered in the index file. +--sparse:: + Use with --cached. Search outside of sparse-checkout definition. + --no-index:: Search files in the current directory that is not managed by Git. diff --git a/builtin/grep.c b/builtin/grep.c index e6bcdf860c..12abd832fa 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -96,6 +96,8 @@ static pthread_cond_t cond_result; static int skip_first_line; +static int grep_sparse = 0; + static void add_work(struct grep_opt *opt, struct grep_source *gs) { if (opt->binary != GREP_BINARY_TEXT) @@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt, for (nr = 0; nr < repo->index->cache_nr; nr++) { const struct cache_entry *ce = repo->index->cache[nr]; - if (!cached && ce_skip_worktree(ce)) + /* + * Skip entries with SKIP_WORKTREE unless both --sparse and + * --cached are given. + */ + if (!(grep_sparse && cached) && ce_skip_worktree(ce)) continue; strbuf_setlen(&name, name_base_len); @@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix) PARSE_OPT_NOCOMPLETE), OPT_INTEGER('m', "max-count", &opt.max_count, N_("maximum number of results per file")), + OPT_BOOL(0, "sparse", &grep_sparse, + N_("search the contents of files outside the sparse-checkout definition")), OPT_END() }; grep_prefix = prefix; diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh index eb59564565..a9879cc980 100755 --- a/t/t7817-grep-sparse-checkout.sh +++ b/t/t7817-grep-sparse-checkout.sh @@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p test_cmp expect actual ' -test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' ' +test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' ' + cat >expect <<-EOF && + a:text + EOF + git grep --cached "text" >actual && + test_cmp expect actual && + cat >expect <<-EOF && a:text b:text dir/c:text EOF - git grep --cached "text" >actual && + git grep --cached --sparse "text" >actual && test_cmp expect actual ' @@ -143,7 +149,15 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu test_cmp expect actual ' -test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' ' +test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' ' + cat >expect <<-EOF && + a:text + sub/B/b:text + sub2/a:text + EOF + git grep --recurse-submodules --cached "text" >actual && + test_cmp expect actual && + cat >expect <<-EOF && a:text b:text @@ -152,7 +166,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th sub/B/b:text sub2/a:text EOF - git grep --recurse-submodules --cached "text" >actual && + git grep --recurse-submodules --cached --sparse "text" >actual && test_cmp expect actual ' @@ -166,7 +180,15 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a test_cmp expect actual ' -test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' ' +test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' ' + cat >expect <<-EOF && + a:text + EOF + test_when_finished "git update-index --no-assume-unchanged b" && + git update-index --assume-unchanged b && + git grep --cached text >actual && + test_cmp expect actual && + cat >expect <<-EOF && a:text b:text @@ -174,7 +196,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and EOF test_when_finished "git update-index --no-assume-unchanged b" && git update-index --assume-unchanged b && - git grep --cached text >actual && + git grep --cached --sparse text >actual && test_cmp expect actual ' From patchwork Sat Sep 3 00:36:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shaoxuan Yuan X-Patchwork-Id: 12964864 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6700CC38145 for ; Sat, 3 Sep 2022 00:38:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231626AbiICAi1 (ORCPT ); Fri, 2 Sep 2022 20:38:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59516 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231555AbiICAiT (ORCPT ); Fri, 2 Sep 2022 20:38:19 -0400 Received: from mail-oa1-x31.google.com (mail-oa1-x31.google.com [IPv6:2001:4860:4864:20::31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A94618F94F for ; Fri, 2 Sep 2022 17:38:18 -0700 (PDT) Received: by mail-oa1-x31.google.com with SMTP id 586e51a60fabf-11f4e634072so8660219fac.13 for ; Fri, 02 Sep 2022 17:38:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=QwxiLJLNI7zY6Q83FGkS3RkpwPiqHH+qBSBQqh0KA3U=; b=HqB6TJFzCq9JzpIe2PZXQ53Kj6lH9fm4PHOzE1qVqzy7YzceCVcU72Lz8IPOJHv3ri eljuCt9NJHsh2DuixS9mYLSUppkfRB/lkZ1dWZBdKNk0MVLufa65DUzk9q2vKaoKcwbO butIggI46m7frUCguc00NbAZZlP6Qzz5JdWa0Y6SUqmuBNvfaXAUySK80lRuYFqhSjJA sQnD4SP6EdTfZC7nOktEe6N2B4Hbj4P50LSuoK572gsgHrKMgkg99XL/TEs/tTiqCY+U RjcHQsoQXG1roNCz7DQeYUwEfaEsCmFEEDgsKdDpSxmu3UIPKLk0hZR9otBg3GzTmFCw LN1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=QwxiLJLNI7zY6Q83FGkS3RkpwPiqHH+qBSBQqh0KA3U=; b=JtHjf2LBKyAXHy1C7qwihqUrrBn92fT10FTh507wLENhNrwGBjJ2c3rgVEdgUd1bxE SON+nImZUvjWd0EhqAPLkmtYjpPx7nFGcdQSwDjkcG3KSWwZAFm/dhndrt/JgmraiVZN NAvY/kZCJ6SlG00iM0K7C3I6RTHvpgd0UfxBE2qZN6zVG1xDj0geXd7lQrbthAzOZBQF fc/ijsAhTgZ8pde1/oWNaLHcZ1Xf7y5tCp9NtAuApdoyb3nFZhXZ/9oObgQscpaEQY3l wcmVx8kfzfB6b6UCFgpVm2zgJo0JCsDz9l0m6qiVibLPelhzpQuNdFpvCq5PYHqXcdxm DyNA== X-Gm-Message-State: ACgBeo1Oonp1nPhtEjgCOLSj4MHIy+T+hpduJGIXcxuKBCLHhy5vuCzs zulTMbZNIxm5eOXycLKdIBWCK8fySj8= X-Google-Smtp-Source: AA6agR5sPM9enqslnhF5LxQ5vS/DPtyKZjlqpV3HS+MC9So61Sf6NuLrj/J8lQbgjdn7Ta3ZiS1Wmg== X-Received: by 2002:a05:6870:4181:b0:125:72da:9bfd with SMTP id y1-20020a056870418100b0012572da9bfdmr1317759oac.232.1662165497849; Fri, 02 Sep 2022 17:38:17 -0700 (PDT) Received: from ffyuanda.localdomain (99-110-131-145.lightspeed.irvnca.sbcglobal.net. [99.110.131.145]) by smtp.gmail.com with ESMTPSA id n6-20020a4ad626000000b00435785e7b49sm1172472oon.19.2022.09.02.17.38.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Sep 2022 17:38:17 -0700 (PDT) From: Shaoxuan Yuan To: git@vger.kernel.org Cc: derrickstolee@github.com, vdye@github.com, gitster@pobox.com, Shaoxuan Yuan Subject: [PATCH v4 2/3] builtin/grep.c: integrate with sparse index Date: Fri, 2 Sep 2022 17:36:22 -0700 Message-Id: <20220903003623.64750-3-shaoxuan.yuan02@gmail.com> X-Mailer: git-send-email 2.37.0 In-Reply-To: <20220903003623.64750-1-shaoxuan.yuan02@gmail.com> References: <20220817075633.217934-1-shaoxuan.yuan02@gmail.com> <20220903003623.64750-1-shaoxuan.yuan02@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Turn on sparse index and remove ensure_full_index(). Change it to only expand the index when using --sparse. The p2000 tests do not demonstrate a significant improvement, because the index read is a small portion of the full process time, compared to the blob parsing. The times below reflect the time spent in the "do_read_index" trace region as shown using GIT_TRACE2_PERF=1. The tests demonstrate a ~99.4% execution time reduction for `git grep` using a sparse index. Test HEAD~ HEAD ----------------------------------------------------------------------------- git grep --cached bogus (full-v3) 0.019 0.018 (-5.2%) git grep --cached bogus (full-v4) 0.017 0.016 (-5.8%) git grep --cached bogus (sparse-v3) 0.29 0.0015 (-99.4%) git grep --cached bogus (sparse-v4) 0.30 0.0018 (-99.4%) Optional reading about performance test results ----------------------------------------------- Notice that because `git-grep` needs to parse blobs in the index, the index reading time is minuscule comparing to the object parsing time. And because of this, the p2000 test results cannot clearly reflect the speedup for index reading: combining with the object parsing time, the aggregated time difference is extremely close between HEAD~1 and HEAD. Hence, the results presenting here are not directly extracted from the p2000 test results. Instead, to make the performance difference more visible, the test command is manually ran with GIT_TRACE2_PERF in the four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here are then extracted from the time difference between "region_enter" and "region_leave" of label "do_read_index". Helped-by: Victoria Dye Helped-by: Derrick Stolee Signed-off-by: Shaoxuan Yuan --- builtin/grep.c | 10 ++++++++-- t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++ 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/builtin/grep.c b/builtin/grep.c index 12abd832fa..a0b4dbc1dc 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -522,8 +522,9 @@ static int grep_cache(struct grep_opt *opt, if (repo_read_index(repo) < 0) die(_("index file corrupt")); - /* TODO: audit for interaction with sparse-index. */ - ensure_full_index(repo->index); + if (grep_sparse) + ensure_full_index(repo->index); + for (nr = 0; nr < repo->index->cache_nr; nr++) { const struct cache_entry *ce = repo->index->cache[nr]; @@ -992,6 +993,11 @@ int cmd_grep(int argc, const char **argv, const char *prefix) PARSE_OPT_KEEP_DASHDASH | PARSE_OPT_STOP_AT_NON_OPTION); + if (the_repository->gitdir) { + prepare_repo_settings(the_repository); + the_repository->settings.command_requires_full_index = 0; + } + if (use_index && !startup_info->have_repository) { int fallback = 0; git_config_get_bool("grep.fallbacktonoindex", &fallback); diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh index 0302e36fd6..63becc3138 100755 --- a/t/t1092-sparse-checkout-compatibility.sh +++ b/t/t1092-sparse-checkout-compatibility.sh @@ -1972,4 +1972,22 @@ test_expect_success 'sparse index is not expanded: rm' ' ensure_not_expanded rm -r deep ' +test_expect_success 'grep with --sparse and --cached' ' + init_repos && + + test_all_match git grep --sparse --cached a && + test_all_match git grep --sparse --cached a -- "folder1/*" +' + +test_expect_success 'grep is not expanded' ' + init_repos && + + ensure_not_expanded grep a && + ensure_not_expanded grep a -- deep/* && + + # All files within the folder1/* pathspec are sparse, + # so this command does not find any matches + ensure_not_expanded ! grep a -- folder1/* +' + test_done From patchwork Sat Sep 3 00:36:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Shaoxuan Yuan X-Patchwork-Id: 12964865 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64660C38145 for ; Sat, 3 Sep 2022 00:38:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231665AbiICAig (ORCPT ); Fri, 2 Sep 2022 20:38:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59530 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231575AbiICAiU (ORCPT ); Fri, 2 Sep 2022 20:38:20 -0400 Received: from mail-oa1-x31.google.com (mail-oa1-x31.google.com [IPv6:2001:4860:4864:20::31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8CB1D9D8CC for ; Fri, 2 Sep 2022 17:38:19 -0700 (PDT) Received: by mail-oa1-x31.google.com with SMTP id 586e51a60fabf-12243fcaa67so8727015fac.8 for ; Fri, 02 Sep 2022 17:38:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=XnTv8ehcNlZQLSeFJL+zZ7NGe148opnHVoIPp+6aRYI=; b=bNNazmYZyyBMCQhSe4mobbEdN+XEd4vX6zmZ2ZZl/0xzlcinIsteCCDG1R+zc3TBRb XpFxVky1U+pcaiPjHeOmAJvJD6m9AC8mCQvhTr/8xRzrJwUdFPbkerKaSJtLsNJb2VDd oJhpsgE8DEsZE1KPl5Msn8ituSsNWcwcqVqUeOlcNQN9u1Z0MPi4XGNWONFqTr2wabEn I7vsQvulEt/edi6dxymMjGksiFaKBtrbb9YqzOOsoWU1n+hJMlHMTZrKoBQXp64mCzYy axh/3cQPdledXHkx5GjDRLlUHvWQNJeSWNUOA3wjCwylOROq8Zu1zrz4zCI4IEYPT+C3 rRMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=XnTv8ehcNlZQLSeFJL+zZ7NGe148opnHVoIPp+6aRYI=; b=NBhwp4hhJZFH2Y0rEeyS/zBSbH7r9YCAMQfbEXf8C5iUtaf48/vn9gG4Rwe41cFspI 8IAnNxTW+alOh7ODVkt4w+d+pTx8GX73ZbaCCvsyU492QN6N4F9pMcK4GYkDExwKfwQ/ g60ExxjiOp2MBq+mMuEp1L7w+9VcpvW7N2cw7IIA8402Ohk1uoRQZlQ0FrBbpfkGdUgB adsbcRzicX0QX2VcYEglAqdo/m5A6fFsCG1Mwu2fkI6Mz0YACy5bajKOmw9aGF3wiUAJ eMRrSeuCtKJnsQKo1QHnr7kgki/o//etS3y8uSc3Iw477GwH0wmjvkkHFEQeU/mlIVKR ftNw== X-Gm-Message-State: ACgBeo3r3PC3OHm9w524Vcws6zl+lOrVo6zjT7vwM9/HzaWbVO7ipGkD r4yDtS1GI4X7GBsCGnsDqdRsfEfKUjg= X-Google-Smtp-Source: AA6agR5SjybcC5nB2lvczwmdwxCu4hlN3M1sS8eqKKZ7hA1pU8tGhqlqdtculfhea5iROenRB8qf/Q== X-Received: by 2002:a05:6808:23d5:b0:343:6e35:c726 with SMTP id bq21-20020a05680823d500b003436e35c726mr3195408oib.26.1662165498734; Fri, 02 Sep 2022 17:38:18 -0700 (PDT) Received: from ffyuanda.localdomain (99-110-131-145.lightspeed.irvnca.sbcglobal.net. [99.110.131.145]) by smtp.gmail.com with ESMTPSA id n6-20020a4ad626000000b00435785e7b49sm1172472oon.19.2022.09.02.17.38.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Sep 2022 17:38:18 -0700 (PDT) From: Shaoxuan Yuan To: git@vger.kernel.org Cc: derrickstolee@github.com, vdye@github.com, gitster@pobox.com, Shaoxuan Yuan Subject: [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Date: Fri, 2 Sep 2022 17:36:23 -0700 Message-Id: <20220903003623.64750-4-shaoxuan.yuan02@gmail.com> X-Mailer: git-send-email 2.37.0 In-Reply-To: <20220903003623.64750-1-shaoxuan.yuan02@gmail.com> References: <20220817075633.217934-1-shaoxuan.yuan02@gmail.com> <20220903003623.64750-1-shaoxuan.yuan02@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Before this patch, whenever --sparse is used, `git-grep` utilizes the ensure_full_index() method to expand the index and search all the entries. Because this method requires walking all the trees and constructing the index, it is the slow part within the whole command. To achieve better performance, this patch uses grep_tree() to search the sparse directory entries and get rid of the ensure_full_index() method. Why grep_tree() is a better choice over ensure_full_index()? 1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks into every sparse-directory entry (represented by a tree) recursively when looping over the index, and the result of doing so matches the result of expanding the index. 2) grep_tree() utilizes pathspecs to limit the scope of searching. ensure_full_index() always expands the index when --sparse is used, that means it will always walk all the trees and blobs in the repo without caring if the user only wants a subset of the content, i.e. using a pathspec. On the other hand, grep_tree() will only search the contents that match the pathspec, and thus possibly walking fewer trees. 3) grep_tree() does not construct and copy back a new index, while ensure_full_index() does. This also saves some time. ---------------- Performance test - Summary: p2000 tests demonstrate a ~71% execution time reduction for `git grep --cached --sparse bogus -- "f2/f1/f1/*"` using tree-walking logic. However, notice that this result varies depending on the pathspec given. See below "Command used for testing" for more details. Test HEAD~ HEAD ------------------------------------------------------- 2000.78: git grep ... (full-v3) 0.35 0.39 (≈) 2000.79: git grep ... (full-v4) 0.36 0.30 (≈) 2000.80: git grep ... (sparse-v3) 0.88 0.23 (-73.8%) 2000.81: git grep ... (sparse-v4) 0.83 0.26 (-68.6%) - Command used for testing: git grep --cached --sparse bogus -- "f2/f1/f1/*" The reason for specifying a pathspec is that, if we don't specify a pathspec, then grep_tree() will walk all the trees and blobs to find the pattern, and the time consumed doing so is not too different from using the original ensure_full_index() method, which also spends most of the time walking trees. However, when a pathspec is specified, this latest logic will only walk the area of trees enclosed by the pathspec, and the time consumed is reasonably a lot less. Generally speaking, because the performance gain is acheived by walking less trees, which are specified by the pathspec, the HEAD time v.s. HEAD~ time in sparse-v[3|4], should be proportional to "pathspec enclosed area" v.s. "all area", respectively. Namely, the wider the is encompassing, the less the performance difference between HEAD~ and HEAD, and vice versa. That is, if we don't specify a pathspec, the performance difference [1] is indistinguishable: both methods walk all the trees and take generally same amount of time (even with the index construction time included for ensure_full_index()). [1] Performance test result without pathspec (hence walking all trees): Command used: git grep --cached --sparse bogus Test HEAD~ HEAD --------------------------------------------------- 2000.78: git grep ... (full-v3) 6.17 5.19 (≈) 2000.79: git grep ... (full-v4) 6.19 5.46 (≈) 2000.80: git grep ... (sparse-v3) 6.57 6.44 (≈) 2000.81: git grep ... (sparse-v4) 6.65 6.28 (≈) Suggested-by: Derrick Stolee Helped-by: Derrick Stolee Helped-by: Victoria Dye Signed-off-by: Shaoxuan Yuan --- builtin/grep.c | 17 +++++++++++++---- t/perf/p2000-sparse-operations.sh | 1 + t/t1092-sparse-checkout-compatibility.sh | 10 +++++++++- 3 files changed, 23 insertions(+), 5 deletions(-) diff --git a/builtin/grep.c b/builtin/grep.c index a0b4dbc1dc..d8c086abff 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -522,9 +522,6 @@ static int grep_cache(struct grep_opt *opt, if (repo_read_index(repo) < 0) die(_("index file corrupt")); - if (grep_sparse) - ensure_full_index(repo->index); - for (nr = 0; nr < repo->index->cache_nr; nr++) { const struct cache_entry *ce = repo->index->cache[nr]; @@ -537,8 +534,20 @@ static int grep_cache(struct grep_opt *opt, strbuf_setlen(&name, name_base_len); strbuf_addstr(&name, ce->name); + if (S_ISSPARSEDIR(ce->ce_mode)) { + enum object_type type; + struct tree_desc tree; + void *data; + unsigned long size; + + data = read_object_file(&ce->oid, &type, &size); + init_tree_desc(&tree, data, size); - if (S_ISREG(ce->ce_mode) && + hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0); + strbuf_reset(&name); + strbuf_addstr(&name, ce->name); + free(data); + } else if (S_ISREG(ce->ce_mode) && match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL, S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode))) { diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh index fce8151d41..3242cfe91a 100755 --- a/t/perf/p2000-sparse-operations.sh +++ b/t/perf/p2000-sparse-operations.sh @@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD test_perf_on_all git checkout-index -f --all test_perf_on_all git update-index --add --remove $SPARSE_CONE/a test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a" +test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/*" test_done diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh index 63becc3138..56e4614276 100755 --- a/t/t1092-sparse-checkout-compatibility.sh +++ b/t/t1092-sparse-checkout-compatibility.sh @@ -1987,7 +1987,15 @@ test_expect_success 'grep is not expanded' ' # All files within the folder1/* pathspec are sparse, # so this command does not find any matches - ensure_not_expanded ! grep a -- folder1/* + ensure_not_expanded ! grep a -- folder1/* && + + # test out-of-cone pathspec with or without wildcard + ensure_not_expanded grep --sparse --cached a -- "folder1/a" && + ensure_not_expanded grep --sparse --cached a -- "folder1/*" && + + # test in-cone pathspec with or without wildcard + ensure_not_expanded grep --sparse --cached a -- "deep/a" && + ensure_not_expanded grep --sparse --cached a -- "deep/*" ' test_done