From patchwork Thu Sep 8 00:18:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shaoxuan Yuan X-Patchwork-Id: 12969495 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88C33C54EE9 for ; Thu, 8 Sep 2022 00:19:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230173AbiIHATu (ORCPT ); Wed, 7 Sep 2022 20:19:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230130AbiIHATo (ORCPT ); Wed, 7 Sep 2022 20:19:44 -0400 Received: from mail-oa1-x2a.google.com (mail-oa1-x2a.google.com [IPv6:2001:4860:4864:20::2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 241832C666 for ; Wed, 7 Sep 2022 17:19:43 -0700 (PDT) Received: by mail-oa1-x2a.google.com with SMTP id 586e51a60fabf-1278a61bd57so21647764fac.7 for ; Wed, 07 Sep 2022 17:19:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=jOtJ79aFd8RCKOpzHr11un7LMCyqar9A66DYrIrSupY=; b=BR7Q4fNgT2aiFTZDfZFALMOIJ+SzCEqELwAMExedUYYrtU6KE/1QhhPlzkl/nBE17C pr1HCFR+Zbd50I4bjHIRpECd04HWpyXetBGIN9JTlVvlxUmizjXbj5dNbVPGT59LRyhL CACVQ1fIEb/S1vTyl/1e0EPEXHhbGeYCNbChVMgYA8D0vOWlxlerneBuVhcB3/5FjJI6 JYbJxzHBeCg1YEzyBuHaTRxLbKtID99UQ40Zz+fQsW2llfBqZZ/e0fXKDoLUDXYs9MnQ Q+wIpczr+D/SVXMVYlLPZYPccR9dB1y4d/IuExoGDoRMxWmT2adb3qXkKCz00QJXJV0x pOxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=jOtJ79aFd8RCKOpzHr11un7LMCyqar9A66DYrIrSupY=; b=AuMiX4OKgQeJbIN5AHB4n5cF05Q5mpmf6dSMeibLAgNx783jNC43t3ac7XRBvsYTF7 OAUub8zAvdcZuk7AwW9P4clrKxRHPI6cZWrttQXdt3nWQOv+vYagXU82280mQJOyTXci SPjTRX67c92bo9LKK/oXDqhnihZjdo09zXMcl2wD/kjHuazly2XcgJU+57gAATfr040N SVTnYzQiZ2PFkVtvCudu57xa/CISpZmSyjO8BVKbfjkaHJ6q/KCaTAZv3Cf9bNeq7vvU g6ie1/XZDrM5qL7LG3L9oxsnbot+z+MOnidDqqybKsG/ljjw81g6mrwunUrH4umdZoAp r9vw== X-Gm-Message-State: ACgBeo3VxfPKaFWzBv130PXSYwDWXav/5QrsmPwXwyx68gI6Ydt86nEO JSsa1kWpqjdH0MSeiIjFByw= X-Google-Smtp-Source: AA6agR4Q3fkNRwQbB5CDTCroHq4aqemlv9DM8t9J+MMVKDwODvL5jWAnXGD62tfDmYR+w0eTQo74ZA== X-Received: by 2002:a05:6870:3285:b0:11f:5465:4943 with SMTP id q5-20020a056870328500b0011f54654943mr534691oac.188.1662596382442; Wed, 07 Sep 2022 17:19:42 -0700 (PDT) Received: from ffyuanda.localdomain (99-110-131-145.lightspeed.irvnca.sbcglobal.net. [99.110.131.145]) by smtp.gmail.com with ESMTPSA id j95-20020a9d17e8000000b00638dd127f54sm7662091otj.1.2022.09.07.17.19.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Sep 2022 17:19:42 -0700 (PDT) From: Shaoxuan Yuan To: shaoxuan.yuan02@gmail.com Cc: derrickstolee@github.com, vdye@github.com, git@vger.kernel.org, gitster@pobox.com Subject: [PATCH v5 1/3] builtin/grep.c: add --sparse option Date: Wed, 7 Sep 2022 17:18:52 -0700 Message-Id: <20220908001854.206789-2-shaoxuan.yuan02@gmail.com> X-Mailer: git-send-email 2.37.0 In-Reply-To: <20220908001854.206789-1-shaoxuan.yuan02@gmail.com> References: <20220817075633.217934-1-shaoxuan.yuan02@gmail.com> <20220908001854.206789-1-shaoxuan.yuan02@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Add a --sparse option to `git-grep`. When the '--cached' option is used with the 'git grep' command, the search is limited to the blobs found in the index, not in the worktree. If the user has enabled sparse-checkout, this might present more results than they would like, since the files outside of the sparse-checkout are unlikely to be important to them. Change the default behavior of 'git grep' to focus on the files within the sparse-checkout definition. To enable the previous behavior, add a '--sparse' option to 'git grep' that triggers the old behavior that inspects paths outside of the sparse-checkout definition when paired with the '--cached' option. Suggested-by: Victoria Dye Helped-by: Derrick Stolee Helped-by: Victoria Dye Signed-off-by: Shaoxuan Yuan --- Documentation/git-grep.txt | 5 ++++- builtin/grep.c | 10 +++++++++- t/t7817-grep-sparse-checkout.sh | 34 +++++++++++++++++++++++++++------ 3 files changed, 41 insertions(+), 8 deletions(-) diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt index 58d944bd57..bdd3d5b8a6 100644 --- a/Documentation/git-grep.txt +++ b/Documentation/git-grep.txt @@ -28,7 +28,7 @@ SYNOPSIS [-f ] [-e] [--and|--or|--not|(|)|-e ...] [--recurse-submodules] [--parent-basename ] - [ [--[no-]exclude-standard] [--cached | --no-index | --untracked] | ...] + [ [--[no-]exclude-standard] [--cached [--sparse] | --no-index | --untracked] | ...] [--] [...] DESCRIPTION @@ -45,6 +45,9 @@ OPTIONS Instead of searching tracked files in the working tree, search blobs registered in the index file. +--sparse:: + Use with --cached. Search outside of sparse-checkout definition. + --no-index:: Search files in the current directory that is not managed by Git. diff --git a/builtin/grep.c b/builtin/grep.c index e6bcdf860c..12abd832fa 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -96,6 +96,8 @@ static pthread_cond_t cond_result; static int skip_first_line; +static int grep_sparse = 0; + static void add_work(struct grep_opt *opt, struct grep_source *gs) { if (opt->binary != GREP_BINARY_TEXT) @@ -525,7 +527,11 @@ static int grep_cache(struct grep_opt *opt, for (nr = 0; nr < repo->index->cache_nr; nr++) { const struct cache_entry *ce = repo->index->cache[nr]; - if (!cached && ce_skip_worktree(ce)) + /* + * Skip entries with SKIP_WORKTREE unless both --sparse and + * --cached are given. + */ + if (!(grep_sparse && cached) && ce_skip_worktree(ce)) continue; strbuf_setlen(&name, name_base_len); @@ -963,6 +969,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix) PARSE_OPT_NOCOMPLETE), OPT_INTEGER('m', "max-count", &opt.max_count, N_("maximum number of results per file")), + OPT_BOOL(0, "sparse", &grep_sparse, + N_("search the contents of files outside the sparse-checkout definition")), OPT_END() }; grep_prefix = prefix; diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh index eb59564565..a9879cc980 100755 --- a/t/t7817-grep-sparse-checkout.sh +++ b/t/t7817-grep-sparse-checkout.sh @@ -118,13 +118,19 @@ test_expect_success 'grep searches unmerged file despite not matching sparsity p test_cmp expect actual ' -test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit' ' +test_expect_success 'grep --cached and --sparse searches entries with the SKIP_WORKTREE bit' ' + cat >expect <<-EOF && + a:text + EOF + git grep --cached "text" >actual && + test_cmp expect actual && + cat >expect <<-EOF && a:text b:text dir/c:text EOF - git grep --cached "text" >actual && + git grep --cached --sparse "text" >actual && test_cmp expect actual ' @@ -143,7 +149,15 @@ test_expect_success 'grep --recurse-submodules honors sparse checkout in submodu test_cmp expect actual ' -test_expect_success 'grep --recurse-submodules --cached searches entries with the SKIP_WORKTREE bit' ' +test_expect_success 'grep --recurse-submodules --cached and --sparse searches entries with the SKIP_WORKTREE bit' ' + cat >expect <<-EOF && + a:text + sub/B/b:text + sub2/a:text + EOF + git grep --recurse-submodules --cached "text" >actual && + test_cmp expect actual && + cat >expect <<-EOF && a:text b:text @@ -152,7 +166,7 @@ test_expect_success 'grep --recurse-submodules --cached searches entries with th sub/B/b:text sub2/a:text EOF - git grep --recurse-submodules --cached "text" >actual && + git grep --recurse-submodules --cached --sparse "text" >actual && test_cmp expect actual ' @@ -166,7 +180,15 @@ test_expect_success 'working tree grep does not search the index with CE_VALID a test_cmp expect actual ' -test_expect_success 'grep --cached searches index entries with both CE_VALID and SKIP_WORKTREE' ' +test_expect_success 'grep --cached and --sparse searches index entries with both CE_VALID and SKIP_WORKTREE' ' + cat >expect <<-EOF && + a:text + EOF + test_when_finished "git update-index --no-assume-unchanged b" && + git update-index --assume-unchanged b && + git grep --cached text >actual && + test_cmp expect actual && + cat >expect <<-EOF && a:text b:text @@ -174,7 +196,7 @@ test_expect_success 'grep --cached searches index entries with both CE_VALID and EOF test_when_finished "git update-index --no-assume-unchanged b" && git update-index --assume-unchanged b && - git grep --cached text >actual && + git grep --cached --sparse text >actual && test_cmp expect actual ' From patchwork Thu Sep 8 00:18:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shaoxuan Yuan X-Patchwork-Id: 12969496 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D540C38145 for ; Thu, 8 Sep 2022 00:19:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230181AbiIHATx (ORCPT ); Wed, 7 Sep 2022 20:19:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53370 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230140AbiIHATo (ORCPT ); Wed, 7 Sep 2022 20:19:44 -0400 Received: from mail-oa1-x2f.google.com (mail-oa1-x2f.google.com [IPv6:2001:4860:4864:20::2f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E25BE2CDC7 for ; Wed, 7 Sep 2022 17:19:43 -0700 (PDT) Received: by mail-oa1-x2f.google.com with SMTP id 586e51a60fabf-127ba06d03fso16618962fac.3 for ; Wed, 07 Sep 2022 17:19:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=QwxiLJLNI7zY6Q83FGkS3RkpwPiqHH+qBSBQqh0KA3U=; b=O5HUIM3o6G/qEx1cIWKfxDUIF3eg7yv4xL0BEtyc2s9pWqgJgnzllG0lLI8h+NLf4B SRIcAbgNZKCrJWgl5gd9Jd57Riz2PL3GOGQQ6YwSTdEYyzyUmvimjoFr8cc4Au15/PaN CNu/HMrKY827HHNhhbWYPtmLvv+3CCELuFmi5pTXmV2LnpVRPjoC6jb2FHz6j7OQIhnV E6DqQgwoe3i5wIv94BvrIjhS+wm62s+s+Pgylb7cPUZPL52lamukLvDhVF1ScjrBjL9F XiBatPQr53hFCZ90VtCAMk6HgjOCdYzOLlAsCC3zbzFKz0W8RMsN4uOQw8AF9xr6zvmh cjyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=QwxiLJLNI7zY6Q83FGkS3RkpwPiqHH+qBSBQqh0KA3U=; b=bvbI6L6kPfuSxjqzmnA9VSy2lBRNXhBBfcrRHOSBAt4rVFtU68mo8k4v5+N3em0jDU w+Tv+YMw5xs4H22exrOLZCWX45u5g1fJGLyXNxD3jfajh47gi5rVkUqgVLwK70h/8y64 KT9Ls6jAHO6ughtFLRpJhhbemwP2LQRvn6ryfV3PwB+d8lUXC3GsgO+E8he0nJ8sWHm9 OXPIJ/zozxB57QjOwHJZPYYsn2uynjdWRGHIjPX/wimmJ0aP2wi/KPFv0WkUGG7SEFsl mCYe38WN/PLUOECNo5SR2X80ArSJifmh8eZwJShhqQ4AjR3RVbiEWrXQzvm5o7tOWqks VifQ== X-Gm-Message-State: ACgBeo0utJPYjj1yXgC8Li6qnJGJY4ZHLoNjhy0Kh+l5JUpK284I8gSh H/AhC5wu67rFi2F/P250+QY= X-Google-Smtp-Source: AA6agR5oT3hyfvOzeXUWC/0Mij9VXsymP1ycW5hA/IVXXIBM1elPNCGYGbjIoulrNRCC5ldvmCQRtQ== X-Received: by 2002:a05:6870:f14a:b0:127:9b8d:7ef3 with SMTP id l10-20020a056870f14a00b001279b8d7ef3mr506501oac.200.1662596383311; Wed, 07 Sep 2022 17:19:43 -0700 (PDT) Received: from ffyuanda.localdomain (99-110-131-145.lightspeed.irvnca.sbcglobal.net. [99.110.131.145]) by smtp.gmail.com with ESMTPSA id j95-20020a9d17e8000000b00638dd127f54sm7662091otj.1.2022.09.07.17.19.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Sep 2022 17:19:43 -0700 (PDT) From: Shaoxuan Yuan To: shaoxuan.yuan02@gmail.com Cc: derrickstolee@github.com, vdye@github.com, git@vger.kernel.org, gitster@pobox.com Subject: [PATCH v5 2/3] builtin/grep.c: integrate with sparse index Date: Wed, 7 Sep 2022 17:18:53 -0700 Message-Id: <20220908001854.206789-3-shaoxuan.yuan02@gmail.com> X-Mailer: git-send-email 2.37.0 In-Reply-To: <20220908001854.206789-1-shaoxuan.yuan02@gmail.com> References: <20220817075633.217934-1-shaoxuan.yuan02@gmail.com> <20220908001854.206789-1-shaoxuan.yuan02@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Turn on sparse index and remove ensure_full_index(). Change it to only expand the index when using --sparse. The p2000 tests do not demonstrate a significant improvement, because the index read is a small portion of the full process time, compared to the blob parsing. The times below reflect the time spent in the "do_read_index" trace region as shown using GIT_TRACE2_PERF=1. The tests demonstrate a ~99.4% execution time reduction for `git grep` using a sparse index. Test HEAD~ HEAD ----------------------------------------------------------------------------- git grep --cached bogus (full-v3) 0.019 0.018 (-5.2%) git grep --cached bogus (full-v4) 0.017 0.016 (-5.8%) git grep --cached bogus (sparse-v3) 0.29 0.0015 (-99.4%) git grep --cached bogus (sparse-v4) 0.30 0.0018 (-99.4%) Optional reading about performance test results ----------------------------------------------- Notice that because `git-grep` needs to parse blobs in the index, the index reading time is minuscule comparing to the object parsing time. And because of this, the p2000 test results cannot clearly reflect the speedup for index reading: combining with the object parsing time, the aggregated time difference is extremely close between HEAD~1 and HEAD. Hence, the results presenting here are not directly extracted from the p2000 test results. Instead, to make the performance difference more visible, the test command is manually ran with GIT_TRACE2_PERF in the four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here are then extracted from the time difference between "region_enter" and "region_leave" of label "do_read_index". Helped-by: Victoria Dye Helped-by: Derrick Stolee Signed-off-by: Shaoxuan Yuan --- builtin/grep.c | 10 ++++++++-- t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++ 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/builtin/grep.c b/builtin/grep.c index 12abd832fa..a0b4dbc1dc 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -522,8 +522,9 @@ static int grep_cache(struct grep_opt *opt, if (repo_read_index(repo) < 0) die(_("index file corrupt")); - /* TODO: audit for interaction with sparse-index. */ - ensure_full_index(repo->index); + if (grep_sparse) + ensure_full_index(repo->index); + for (nr = 0; nr < repo->index->cache_nr; nr++) { const struct cache_entry *ce = repo->index->cache[nr]; @@ -992,6 +993,11 @@ int cmd_grep(int argc, const char **argv, const char *prefix) PARSE_OPT_KEEP_DASHDASH | PARSE_OPT_STOP_AT_NON_OPTION); + if (the_repository->gitdir) { + prepare_repo_settings(the_repository); + the_repository->settings.command_requires_full_index = 0; + } + if (use_index && !startup_info->have_repository) { int fallback = 0; git_config_get_bool("grep.fallbacktonoindex", &fallback); diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh index 0302e36fd6..63becc3138 100755 --- a/t/t1092-sparse-checkout-compatibility.sh +++ b/t/t1092-sparse-checkout-compatibility.sh @@ -1972,4 +1972,22 @@ test_expect_success 'sparse index is not expanded: rm' ' ensure_not_expanded rm -r deep ' +test_expect_success 'grep with --sparse and --cached' ' + init_repos && + + test_all_match git grep --sparse --cached a && + test_all_match git grep --sparse --cached a -- "folder1/*" +' + +test_expect_success 'grep is not expanded' ' + init_repos && + + ensure_not_expanded grep a && + ensure_not_expanded grep a -- deep/* && + + # All files within the folder1/* pathspec are sparse, + # so this command does not find any matches + ensure_not_expanded ! grep a -- folder1/* +' + test_done From patchwork Thu Sep 8 00:18:54 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Shaoxuan Yuan X-Patchwork-Id: 12969497 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70095ECAAD3 for ; Thu, 8 Sep 2022 00:19:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230183AbiIHATy (ORCPT ); Wed, 7 Sep 2022 20:19:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53534 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230146AbiIHATq (ORCPT ); Wed, 7 Sep 2022 20:19:46 -0400 Received: from mail-oa1-x29.google.com (mail-oa1-x29.google.com [IPv6:2001:4860:4864:20::29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D8563201AB for ; Wed, 7 Sep 2022 17:19:44 -0700 (PDT) Received: by mail-oa1-x29.google.com with SMTP id 586e51a60fabf-127f5411b9cso8840123fac.4 for ; Wed, 07 Sep 2022 17:19:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=iEW+rUnfuhRbU3uJb7LwzvzKPayNq4UvQdxZWwi4tbI=; b=SOASra3YWIhJWicSg50IwotDaYXetVJc5JcsNX8X4UOG6Esv6f5wYzKjRypaEVRCUh 5mNFRM+ZlA39MKjw7WgmhiiXA7y0XmBhyhSp3Z/c8ldfEYWeYPAnk/KioJvVy/aPqhs6 8kCPzBidOiy0OuZS+uyH2LvyzE10a/kE/9QKnwWfz65ajvtJwMyZPI1onsLI+brqCfvl qEC/zB6saKL750svVZd8enHBPMrrMOuZK10zax+sddKED5ZFWjHamaHdJDCrtQcJ6PQV VlpQ1RuHCE3Kg9oWjzySgWK4GJWZ91Sw1J3qw0d84yA4t7eeEC8NGFrusmwzPCNkciMM qKJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=iEW+rUnfuhRbU3uJb7LwzvzKPayNq4UvQdxZWwi4tbI=; b=nOSw6/IyBqw5bYDK+5Dss5NInYd494rU4TGDV4FvjO8k5AggmjF0VM+vZklAcj25HL OGk2ITS0Hhyz9kw4qmXDYb0YIH2hYimfCPdRXnCshpV3TOaX379cocQt7Rn7n6swIMyg uI7zjnqM/q6NYHdgwu38QrGsd5VcpCNvJULhxJJKfVchNi7HUuU537LzcRfQJGy3kbNI M2yegdrJR8A5Has/pnyRLvlGgLF9atPCdyZSkmjppuH4wERlF+MZj5lqMipfzIxJnqYN zirEd8lANcgwFRfKdv2h4gZlOjtAqpw9q78NBkoKc6JNxKuOgbKhyn2/vMuqQgKWAZDW 6tPw== X-Gm-Message-State: ACgBeo3Ipa2zSozV8+xE2ZqMcnvrJZyo64Z1MeyB8JNwxvsWTPphpGL+ 63pfvNu0yxqaZdaOpCstFxk= X-Google-Smtp-Source: AA6agR5wYT7qY9vWUzGoSGkjbzkOTHgeFoPhjnfqqOeablPLRhfUMedsXMqB9MmIeVXKrLhGIbBt7w== X-Received: by 2002:a05:6808:1407:b0:345:b9a3:6d6f with SMTP id w7-20020a056808140700b00345b9a36d6fmr390630oiv.9.1662596384162; Wed, 07 Sep 2022 17:19:44 -0700 (PDT) Received: from ffyuanda.localdomain (99-110-131-145.lightspeed.irvnca.sbcglobal.net. [99.110.131.145]) by smtp.gmail.com with ESMTPSA id j95-20020a9d17e8000000b00638dd127f54sm7662091otj.1.2022.09.07.17.19.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Sep 2022 17:19:43 -0700 (PDT) From: Shaoxuan Yuan To: shaoxuan.yuan02@gmail.com Cc: derrickstolee@github.com, vdye@github.com, git@vger.kernel.org, gitster@pobox.com Subject: [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Date: Wed, 7 Sep 2022 17:18:54 -0700 Message-Id: <20220908001854.206789-4-shaoxuan.yuan02@gmail.com> X-Mailer: git-send-email 2.37.0 In-Reply-To: <20220908001854.206789-1-shaoxuan.yuan02@gmail.com> References: <20220817075633.217934-1-shaoxuan.yuan02@gmail.com> <20220908001854.206789-1-shaoxuan.yuan02@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Before this patch, whenever --sparse is used, `git-grep` utilizes the ensure_full_index() method to expand the index and search all the entries. Because this method requires walking all the trees and constructing the index, it is the slow part within the whole command. To achieve better performance, this patch uses grep_tree() to search the sparse directory entries and get rid of the ensure_full_index() method. Why grep_tree() is a better choice over ensure_full_index()? 1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks into every sparse-directory entry (represented by a tree) recursively when looping over the index, and the result of doing so matches the result of expanding the index. 2) grep_tree() utilizes pathspecs to limit the scope of searching. ensure_full_index() always expands the index when --sparse is used, that means it will always walk all the trees and blobs in the repo without caring if the user only wants a subset of the content, i.e. using a pathspec. On the other hand, grep_tree() will only search the contents that match the pathspec, and thus possibly walking fewer trees. 3) grep_tree() does not construct and copy back a new index, while ensure_full_index() does. This also saves some time. ---------------- Performance test - Summary: p2000 tests demonstrate a ~71% execution time reduction for `git grep --cached --sparse bogus -- "f2/f1/f1/*"` using tree-walking logic. However, notice that this result varies depending on the pathspec given. See below "Command used for testing" for more details. Test HEAD~ HEAD ------------------------------------------------------- 2000.78: git grep ... (full-v3) 0.35 0.39 (≈) 2000.79: git grep ... (full-v4) 0.36 0.30 (≈) 2000.80: git grep ... (sparse-v3) 0.88 0.23 (-73.8%) 2000.81: git grep ... (sparse-v4) 0.83 0.26 (-68.6%) - Command used for testing: git grep --cached --sparse bogus -- "f2/f1/f1/*" The reason for specifying a pathspec is that, if we don't specify a pathspec, then grep_tree() will walk all the trees and blobs to find the pattern, and the time consumed doing so is not too different from using the original ensure_full_index() method, which also spends most of the time walking trees. However, when a pathspec is specified, this latest logic will only walk the area of trees enclosed by the pathspec, and the time consumed is reasonably a lot less. Generally speaking, because the performance gain is acheived by walking less trees, which are specified by the pathspec, the HEAD time v.s. HEAD~ time in sparse-v[3|4], should be proportional to "pathspec enclosed area" v.s. "all area", respectively. Namely, the wider the is encompassing, the less the performance difference between HEAD~ and HEAD, and vice versa. That is, if we don't specify a pathspec, the performance difference [1] is indistinguishable: both methods walk all the trees and take generally same amount of time (even with the index construction time included for ensure_full_index()). [1] Performance test result without pathspec (hence walking all trees): Command used: git grep --cached --sparse bogus Test HEAD~ HEAD --------------------------------------------------- 2000.78: git grep ... (full-v3) 6.17 5.19 (≈) 2000.79: git grep ... (full-v4) 6.19 5.46 (≈) 2000.80: git grep ... (sparse-v3) 6.57 6.44 (≈) 2000.81: git grep ... (sparse-v4) 6.65 6.28 (≈) -------------------------- NEEDSWORK about submodules There are a few NEEDSWORKs that belong to improvements beyond this topic. See the NEEDSWORK in builtin/grep.c::grep_submodule() for more context. The other two NEEDSWORKs in t1092 are also relative. Suggested-by: Derrick Stolee Helped-by: Derrick Stolee Helped-by: Victoria Dye Signed-off-by: Shaoxuan Yuan --- builtin/grep.c | 44 +++++++++++++++++-- t/perf/p2000-sparse-operations.sh | 1 + t/t1092-sparse-checkout-compatibility.sh | 56 +++++++++++++++++++++++- 3 files changed, 96 insertions(+), 5 deletions(-) diff --git a/builtin/grep.c b/builtin/grep.c index a0b4dbc1dc..9a01932253 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -460,6 +460,33 @@ static int grep_submodule(struct grep_opt *opt, * subrepo's odbs to the in-memory alternates list. */ obj_read_lock(); + + /* + * NEEDSWORK: when reading a submodule, the sparsity settings in the + * superproject are incorrectly forgotten or misused. For example: + * + * 1. "command_requires_full_index" + * When this setting is turned on for `grep`, only the superproject + * knows it. All the submodules are read with their own configs + * and get prepare_repo_settings()'d. Therefore, these submodules + * "forget" the sparse-index feature switch. As a result, the index + * of these submodules are expanded unexpectedly. + * + * 2. "core_apply_sparse_checkout" + * When running `grep` in the superproject, this setting is + * populated using the superproject's configs. However, once + * initialized, this config is globally accessible and is read by + * prepare_repo_settings() for the submodules. For instance, if a + * submodule is using a sparse-checkout, however, the superproject + * is not, the result is that the config from the superproject will + * dictate the behavior for the submodule, making it "forget" its + * sparse-checkout state. + * + * 3. "core_sparse_checkout_cone" + * ditto. + * + * Note that this list is not exhaustive. + */ repo_read_gitmodules(subrepo, 0); /* @@ -522,9 +549,6 @@ static int grep_cache(struct grep_opt *opt, if (repo_read_index(repo) < 0) die(_("index file corrupt")); - if (grep_sparse) - ensure_full_index(repo->index); - for (nr = 0; nr < repo->index->cache_nr; nr++) { const struct cache_entry *ce = repo->index->cache[nr]; @@ -537,8 +561,20 @@ static int grep_cache(struct grep_opt *opt, strbuf_setlen(&name, name_base_len); strbuf_addstr(&name, ce->name); + if (S_ISSPARSEDIR(ce->ce_mode)) { + enum object_type type; + struct tree_desc tree; + void *data; + unsigned long size; + + data = read_object_file(&ce->oid, &type, &size); + init_tree_desc(&tree, data, size); - if (S_ISREG(ce->ce_mode) && + hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0); + strbuf_setlen(&name, name_base_len); + strbuf_addstr(&name, ce->name); + free(data); + } else if (S_ISREG(ce->ce_mode) && match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL, S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode))) { diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh index fce8151d41..3242cfe91a 100755 --- a/t/perf/p2000-sparse-operations.sh +++ b/t/perf/p2000-sparse-operations.sh @@ -124,5 +124,6 @@ test_perf_on_all git read-tree -mu HEAD test_perf_on_all git checkout-index -f --all test_perf_on_all git update-index --add --remove $SPARSE_CONE/a test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a" +test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/*" test_done diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh index 63becc3138..fda05faadf 100755 --- a/t/t1092-sparse-checkout-compatibility.sh +++ b/t/t1092-sparse-checkout-compatibility.sh @@ -162,6 +162,19 @@ init_repos () { git -C sparse-index sparse-checkout set deep } +init_repos_as_submodules () { + git reset --hard && + init_repos && + git submodule add ./full-checkout && + git submodule add ./sparse-checkout && + git submodule add ./sparse-index && + + git submodule status >actual && + grep full-checkout actual && + grep sparse-checkout actual && + grep sparse-index actual +} + run_on_sparse () { ( cd sparse-checkout && @@ -1987,7 +2000,48 @@ test_expect_success 'grep is not expanded' ' # All files within the folder1/* pathspec are sparse, # so this command does not find any matches - ensure_not_expanded ! grep a -- folder1/* + ensure_not_expanded ! grep a -- folder1/* && + + # test out-of-cone pathspec with or without wildcard + ensure_not_expanded grep --sparse --cached a -- "folder1/a" && + ensure_not_expanded grep --sparse --cached a -- "folder1/*" && + + # test in-cone pathspec with or without wildcard + ensure_not_expanded grep --sparse --cached a -- "deep/a" && + ensure_not_expanded grep --sparse --cached a -- "deep/*" +' + +# NEEDSWORK: when running `grep` in the superproject with --recurse-submodules, +# Git expands the index of the submodules unexpectedly. Even though `grep` +# builtin is marked as "command_requires_full_index = 0", this config is only +# useful for the superproject. Namely, the submodules have their own configs, +# which are _not_ populated by the one-time sparse-index feature switch. +test_expect_failure 'grep within submodules is not expanded' ' + init_repos_as_submodules && + + # do not use ensure_not_expanded() here, becasue `grep` should be + # run in the superproject, not in "./sparse-index" + GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \ + git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" && + test_region ! index ensure_full_index trace2.txt +' + +# NEEDSWORK: this test is not actually testing the code. The design purpose +# of this test is to verify the grep result when the submodules are using a +# sparse-index. Namely, we want "folder1/" as a tree (a sparse directory); but +# because of the index expansion, we are now grepping the "folder1/a" blob. +# Because of the problem stated above 'grep within submodules is not expanded', +# we don't have the ideal test environment yet. +test_expect_success 'grep sparse directory within submodules' ' + init_repos_as_submodules && + + cat >expect <<-\EOF && + full-checkout/folder1/a:a + sparse-checkout/folder1/a:a + sparse-index/folder1/a:a + EOF + git grep --sparse --cached --recurse-submodules a -- "*/folder1/*" >actual && + test_cmp actual expect ' test_done