From patchwork Fri Jun 28 12:43:21 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13716056 Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A3FEA15572B for ; Fri, 28 Jun 2024 12:43:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578613; cv=none; b=bEQJtqa876MIZnQzTIyz25eKxJGSnudc9/MrCqWcQPFp0x/gioGx7LX74EL2/gWC9pktB23bf4nIU+f3dHhYTJXbxegWvTlr76Z1sp9UwJcZiM+7KioAOy/65BrRHGL4yjRoHDDDo2Lgw/HwnMLAtBNtBw2cd42wntMMqSZ7xBk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578613; c=relaxed/simple; bh=1+YfJ3KeIDNMytgJhgLc/jzlPo48qZI+007fZd8EAuE=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=XSCdILB5SbHLsnEJ3x+hIy9Hat8Us5BycujKzUPUOB6hjfYtJ7ma+ogywm6F2Epp4Z61b9XNYeaFKkF2JJyx+cbtvoLbGvlXxJxowEEO8XMfWOEgNksDk5dMsvHbp5KLtavk0Wn6hAaMPmixcMYDe6AaMAyPINadrAepYUwEkz8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=JuKU5m4z; arc=none smtp.client-ip=209.85.128.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="JuKU5m4z" Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-42565cdf99cso5761635e9.3 for ; Fri, 28 Jun 2024 05:43:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719578608; x=1720183408; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=CR8mOzRt4TH4V8VXf39UAZ4aSlUBCBjfZL+B2dW2tgw=; b=JuKU5m4zVsSRIE1tMFe5pN/00uTYtUk9leakxXrIxvaJp4Ms37Vl3KWgsuDdIYK7TF fGI8r8tjFGWNMlSbsscxxqedQ5SCgqNeL+XqZkD8GsVuW6Xe1wPedHlVul7+kpjTg7x4 JOt02IAwtb7Az9RqpXZB3KCZD9zuZGDZMhXiAPVlrL2S10A4nNMnxB89KWcboxIxLbKk dKhZ4cfamuUWkVlWqccUUyPCYKq7IvK64UqoU0/CQ94CidU1fYUWQXsiInoWpE+6QTSJ 3kOCZo4KZ/HZPftTDPVMcuAQzgIj1+bmPN+69bTsAMe+lgED+Y6mYkDrtD0yx6dERGSu kZ1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719578608; x=1720183408; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CR8mOzRt4TH4V8VXf39UAZ4aSlUBCBjfZL+B2dW2tgw=; b=BFmNsFskwwtrdPd98W+Z9LrtDk0YPgARmlGqRHp2PB5K3yR4DfE+GF9BuTDnBLEGXu sowcQ/uTQxzWrQ1zlRgn50nAyuOzqEVoQWI0weCsgQ1VEvhmBZro5yDPPVEQ5d7E21eX O16B9A7tvJGxoWc6yI+Rgh6p70HDpJT4SBB7eqQ8YS4Zj3/Dc5GAw7BUIvwGMN2aADu/ atuYrRTLzvG+xOFpn655KNapoCiMysWqO/rQYxEAnAOQgqnjEyMv5yzt21tmk/KqbBHG zu3J0z2DXxT9PfGCaGmTVKQSyGW5pE8RkpgWJOWGp3kREPZrKiJEaea7P+IopFGky/Gm m8rQ== X-Gm-Message-State: AOJu0YxT0FTlOCcwEwMjz2evvOwsn/0f+PLTAmVpw4ghH51i6iaTawVd WvVSk3ToLApK1qENcH47h9nuwhXHwaAn4Vc00gc0ZxV81eEQKYyfPtas2w== X-Google-Smtp-Source: AGHT+IGUW21Yq3dC1x9y4W9MP9Z3EB2E/x4ghEmCgiBnzOVYBJ2GWrPZg4761E5+oBrO9UevwGZSyw== X-Received: by 2002:a1c:7716:0:b0:425:64c5:5780 with SMTP id 5b1f17b1804b1-42564c55995mr44681205e9.1.1719578608268; Fri, 28 Jun 2024 05:43:28 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4256b065316sm33392575e9.26.2024.06.28.05.43.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Jun 2024 05:43:27 -0700 (PDT) Message-Id: <0844cda94cffd76ea9e86a0837731cfb4dc7bf88.1719578605.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Fri, 28 Jun 2024 12:43:21 +0000 Subject: [PATCH v3 1/5] sparse-checkout: refactor skip worktree retry logic Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The clear_skip_worktree_from_present_files() method was introduced in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14) to help cases where sparse-checkout is enabled but some paths outside of the sparse-checkout also exist on disk. This operation can be slow as it needs to check path existence in a way not stored in the index, so caching was introduced in d79d299352 (Accelerate clear_skip_worktree_from_present_files() by caching, 2022-01-14). This check is particularly confusing in the presence of a sparse index, as a sparse tree entry corresponding to an existing directory must first be expanded to a full index before examining the paths within. This is currently implemented using a 'goto' and a boolean variable to ensure we restart only once. Even with that caching, it was noticed that this could take a long time to execute. 89aaab11a3 (index: add trace2 region for clear skip worktree, 2022-11-03) introduced trace2 regions to measure this time. Further, the way the loop repeats itself was slightly confusing and prone to breakage, so a BUG() statement was added in 8c7abdc596 (index: raise a bug if the index is materialised more than once, 2022-11-03) to be sure that the second run of the loop does not hit any sparse trees. One thing that can be confusing about the current setup is that the trace2 regions nest and it is not clear that a second loop is running after a sparse index is expanded. Here is an example of what the regions look like in a typical case: | region_enter | ... | label:clear_skip_worktree_from_present_files | region_enter | ... | ..label:update | region_leave | ... | ..label:update | region_enter | ... | ..label:ensure_full_index | region_enter | ... | ....label:update | region_leave | ... | ....label:update | region_leave | ... | ..label:ensure_full_index | data | ... | ..sparse_path_count:1 | data | ... | ..sparse_path_count_full:269538 | region_leave | ... | label:clear_skip_worktree_from_present_files One thing that is particularly difficult to understand about these regions is that most of the time is spent between the close of the ensure_full_index region and the reporting of the end data. This is because of the restart of the loop being within the same region as the first iteration of the loop. This change refactors the method into two separate methods that are traced separately. This will be more important later when we change other features of the methods, but for now the only functional change is the difference in the structure of the trace regions. After this change, the same telemetry section is split into three distinct chunks: | region_enter | ... | label:clear_skip_worktree_from_present_files_sparse | data | ... | ..sparse_path_count:1 | region_leave | ... | label:clear_skip_worktree_from_present_files_sparse | region_enter | ... | label:update | region_leave | ... | label:update | region_enter | ... | label:ensure_full_index | region_enter | ... | ..label:update | region_leave | ... | ..label:update | region_leave | ... | label:ensure_full_index | region_enter | ... | label:clear_skip_worktree_from_present_files_full | data | ... | ..full_path_count:269538 | region_leave | ... | label:clear_skip_worktree_from_present_files_full Here, we see the sparse loop terminating early with its first sparse path being a sparse directory containing a file. Then, that loop's region terminates before ensure_full_index begins (in this case, the cache-tree must also be computed). Then, _after_ the index is expanded, the full loop begins with its own region. Signed-off-by: Derrick Stolee --- sparse-index.c | 77 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 53 insertions(+), 24 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index e48e40cae71..e0457c87fff 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -486,49 +486,78 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, return 0; } -void clear_skip_worktree_from_present_files(struct index_state *istate) +static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate) { const char *last_dirname = NULL; size_t dir_len = 0; int dir_found = 1; - int i; - int path_count[2] = {0, 0}; - int restarted = 0; + int path_count = 0; + int to_restart = 0; - if (!core_apply_sparse_checkout || - sparse_expect_files_outside_of_patterns) - return; - - trace2_region_enter("index", "clear_skip_worktree_from_present_files", + trace2_region_enter("index", "clear_skip_worktree_from_present_files_sparse", istate->repo); -restart: - for (i = 0; i < istate->cache_nr; i++) { + for (int i = 0; i < istate->cache_nr; i++) { struct cache_entry *ce = istate->cache[i]; if (ce_skip_worktree(ce)) { - path_count[restarted]++; + path_count++; if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { if (S_ISSPARSEDIR(ce->ce_mode)) { - if (restarted) - BUG("ensure-full-index did not fully flatten?"); - ensure_full_index(istate); - restarted = 1; - goto restart; + to_restart = 1; + break; } ce->ce_flags &= ~CE_SKIP_WORKTREE; } } } - if (path_count[0]) - trace2_data_intmax("index", istate->repo, - "sparse_path_count", path_count[0]); - if (restarted) - trace2_data_intmax("index", istate->repo, - "sparse_path_count_full", path_count[1]); - trace2_region_leave("index", "clear_skip_worktree_from_present_files", + trace2_data_intmax("index", istate->repo, + "sparse_path_count", path_count); + trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse", + istate->repo); + return to_restart; +} + +static void clear_skip_worktree_from_present_files_full(struct index_state *istate) +{ + const char *last_dirname = NULL; + size_t dir_len = 0; + int dir_found = 1; + + int path_count = 0; + + trace2_region_enter("index", "clear_skip_worktree_from_present_files_full", istate->repo); + for (int i = 0; i < istate->cache_nr; i++) { + struct cache_entry *ce = istate->cache[i]; + + if (S_ISSPARSEDIR(ce->ce_mode)) + BUG("ensure-full-index did not fully flatten?"); + + if (ce_skip_worktree(ce)) { + path_count++; + if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) + ce->ce_flags &= ~CE_SKIP_WORKTREE; + } + } + + trace2_data_intmax("index", istate->repo, + "full_path_count", path_count); + trace2_region_leave("index", "clear_skip_worktree_from_present_files_full", + istate->repo); +} + +void clear_skip_worktree_from_present_files(struct index_state *istate) +{ + if (!core_apply_sparse_checkout || + sparse_expect_files_outside_of_patterns) + return; + + if (clear_skip_worktree_from_present_files_sparse(istate)) { + ensure_full_index(istate); + clear_skip_worktree_from_present_files_full(istate); + } } /* From patchwork Fri Jun 28 12:43:22 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13716057 Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 362E9158868 for ; Fri, 28 Jun 2024 12:43:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578613; cv=none; b=rnl5ev9UK8EI03rEf6ly7Ny0Cmpqe6gThD2PrfsT8UsWijfbOST5DlYVp6aPnXy1vSpwoD8GmsLsC0Gq3MPZhSOgejkO1uDtdBut5EBoC9tKRRCEDh3zAlXms36Xw6H15JmTnh3L8smJiWZQcF6h8CIQvxV8JiGlErg/UKtvTQU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578613; c=relaxed/simple; bh=Qz+nDsLZj+c5Zu02rUyu+EGfNUaXxkn9Om0dikyFjRQ=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=MYdCMd1+aTN+OJyRLVfUuncRv8nRfTqHtEQjt3RTQBLPtIekzz+Tl6yaeIXVjmfxnQgUaLTWvh0OZ57zomQ98zoiSKcEwJY/eAVN6l2JzxA/Jy56xwCxR+1EwRzUBz7rMy4Ebcdjmbu+mPdp60Fhh/ZyeNX7yWF/4U8f/1605yQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ccL0w+V9; arc=none smtp.client-ip=209.85.128.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ccL0w+V9" Received: by mail-wm1-f53.google.com with SMTP id 5b1f17b1804b1-42562a984d3so4474985e9.3 for ; Fri, 28 Jun 2024 05:43:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719578609; x=1720183409; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=rEaRIqkrn/nVU9/+27kkujxPuiy8ceVVFNZtvynEzTw=; b=ccL0w+V9f8EjwmndZH4LziMmSLczdQs5PfvxHMgS20WF/L468VE8EOFd2WjA7m3+du 5p/DLDNQVJemg6HlgyrCUSdz1Ea2wYgQtIUnCG6HIXlBIumNTm/fFS8LCFoZzl7+/VLf XupUCez3Su/QhpJDBeT8VAUdIV71I2eK/Nt4Au8wLJvzsZXP2e8oBg0B5I8LAAWq+E9s m3tFThtbkBvASlKHF3kWynTwIOBkyP6+T8GYsf95IihFdY7QIfH0JF0+gBOSyo/q5cuZ tVbBC6DlOZD9Vx0EHWCI/bGvbSRTU+fM5jxFpXXoWTM8m5WzEuRtuu7BfT1Zx1832zxd a0Fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719578609; x=1720183409; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rEaRIqkrn/nVU9/+27kkujxPuiy8ceVVFNZtvynEzTw=; b=raflh2MdaG/03jB4w0C8PDymX2vrxuTiMFYzOwxgcxeNQagf/JEG48xld8GP9I3oOD 9I3ymHdZKGIO9v1OAsrE7PQdRVgkol7ZuypkdyyQVp7kFTfrVXiJkawn2BJYk9Gz9EuX yQLGQRl19IEZJS+Q6Otf68P8DvcMmlbgvd2iGgv457YprzMPqp84yXYTsRw+8W5OWA2w lDWrUZ/OSsJ0y56FE0ruy1z4aJFvwncC41IV9mp9waKH30b1HWQvU8vsFaBH+BVhwvA7 2rlOxOtjXGOwsdWm6jJhIt54e+Oo0n7tFBX7MslO75RsX16Pvt5WpE+fi6T5+76tYBV2 IySQ== X-Gm-Message-State: AOJu0YzqUfWtViaONXCJ9BXpW8TS4Psews7MJYR5pj746tn2a48YzGp5 QVEap8A8JFH2JZoFkjHDm6+Ej62nKMlmU+Du9eVZzOVlXfDNmK2ZKPlsPw== X-Google-Smtp-Source: AGHT+IHMiXcoh+rYa7tzgL5ietdZ0apECsDjT0qb7QJvdE8j607KDNHtAxNOxuxwfX6sZXMYUL8oPg== X-Received: by 2002:a05:600c:1c29:b0:424:a5df:b998 with SMTP id 5b1f17b1804b1-424a5dfba38mr65167665e9.9.1719578609233; Fri, 28 Jun 2024 05:43:29 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4256b068e93sm33952985e9.24.2024.06.28.05.43.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Jun 2024 05:43:28 -0700 (PDT) Message-Id: In-Reply-To: References: Date: Fri, 28 Jun 2024 12:43:22 +0000 Subject: [PATCH v3 2/5] sparse-index: refactor path_found() Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee In advance of changing the behavior of path_found(), take all of the intermediate data values and group them into a single struct. This simplifies the method prototype as well as the initialization. Future changes can be made directly to the struct and method without changing the callers with this approach. Note that the clear_path_found_data() method is currently empty, as there is nothing to free. This method is a placeholder for future changes that require a non-trivial implementation. Its stub is created now so consumers could call it now and not change in future changes. Signed-off-by: Derrick Stolee --- sparse-index.c | 45 +++++++++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index e0457c87fff..de6e727f5c1 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -439,8 +439,22 @@ void ensure_correct_sparsity(struct index_state *istate) ensure_full_index(istate); } -static int path_found(const char *path, const char **dirname, size_t *dir_len, - int *dir_found) +struct path_found_data { + const char *dirname; + size_t dir_len; + int dir_found; +}; + +#define PATH_FOUND_DATA_INIT { \ + .dir_found = 1 \ +} + +static void clear_path_found_data(struct path_found_data *data) +{ + return; +} + +static int path_found(const char *path, struct path_found_data *data) { struct stat st; char *newdir; @@ -450,7 +464,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, * If dirname corresponds to a directory that doesn't exist, and this * path starts with dirname, then path can't exist. */ - if (!*dir_found && !memcmp(path, *dirname, *dir_len)) + if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len)) return 0; /* @@ -472,15 +486,16 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, * If path starts with directory (which we already lstat'ed and found), * then no need to lstat parent directory again. */ - if (*dir_found && *dirname && memcmp(path, *dirname, *dir_len)) + if (data->dir_found && data->dirname && + memcmp(path, data->dirname, data->dir_len)) return 0; /* Free previous dirname, and cache path's dirname */ - *dirname = path; - *dir_len = newdir - path + 1; + data->dirname = path; + data->dir_len = newdir - path + 1; - tmp = xstrndup(path, *dir_len); - *dir_found = !lstat(tmp, &st); + tmp = xstrndup(path, data->dir_len); + data->dir_found = !lstat(tmp, &st); free(tmp); return 0; @@ -488,9 +503,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate) { - const char *last_dirname = NULL; - size_t dir_len = 0; - int dir_found = 1; + struct path_found_data data = PATH_FOUND_DATA_INIT; int path_count = 0; int to_restart = 0; @@ -502,7 +515,7 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist if (ce_skip_worktree(ce)) { path_count++; - if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { + if (path_found(ce->name, &data)) { if (S_ISSPARSEDIR(ce->ce_mode)) { to_restart = 1; break; @@ -516,14 +529,13 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist "sparse_path_count", path_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse", istate->repo); + clear_path_found_data(&data); return to_restart; } static void clear_skip_worktree_from_present_files_full(struct index_state *istate) { - const char *last_dirname = NULL; - size_t dir_len = 0; - int dir_found = 1; + struct path_found_data data = PATH_FOUND_DATA_INIT; int path_count = 0; @@ -537,7 +549,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista if (ce_skip_worktree(ce)) { path_count++; - if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) + if (path_found(ce->name, &data)) ce->ce_flags &= ~CE_SKIP_WORKTREE; } } @@ -546,6 +558,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista "full_path_count", path_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_full", istate->repo); + clear_path_found_data(&data); } void clear_skip_worktree_from_present_files(struct index_state *istate) From patchwork Fri Jun 28 12:43:23 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13716058 Received: from mail-wr1-f42.google.com (mail-wr1-f42.google.com [209.85.221.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1148B158879 for ; Fri, 28 Jun 2024 12:43:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578614; cv=none; b=YCk7Ia1d6zD4zEpQ2niKsfnTqu38mUmQcUh3Wy1aV4uOgpbsIxSUYFJWW9Yp6pQm1jfskcp19462LQPRRzd8qCTvwk2f16m9NkW41CViXrd5ZibPOTYc41g0yHgjeGovicKJLJab+O/eaClX1fedWZy6RyCIQEQeARsfTCp170M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578614; c=relaxed/simple; bh=mKI90b/YGFpefVFGNK/0fVqAD9pGjE9PJw08JNb6Gps=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=PIRW7csIuWLaV3paesnsJwdbpuYK5qtYaJDvQpnsaRQx1cCKH4KaLwDSia+XVZ5p4LdAL7knK093G/l/cfPEDpWY1ePd597IyQXbDXpP9aXkVxYqrSu5AouxFvTSldZAV6S432TulcAG2SKSlPCZ3Edws3nzJSJH7ZOhZzNN/6A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=C2Zq0BhT; arc=none smtp.client-ip=209.85.221.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="C2Zq0BhT" Received: by mail-wr1-f42.google.com with SMTP id ffacd0b85a97d-361785bfa71so483956f8f.2 for ; Fri, 28 Jun 2024 05:43:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719578611; x=1720183411; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=xO1bF8/D+K6VmcsbNZerxp108u1S6w5egvhSyaxchUI=; b=C2Zq0BhTB25Ri4tkE8nEOarFGx+naDRW6vGRckmtftXxuSKBCSiMgbExsbr591RzXv Sqw8UUcTj0I3OEK22j/9x3tvp/UoybgJSUZgourXkOWAY2ebjbczopimuXOg4wkEul0V LdGuh0lszagqpAF+smJsyX+ZNm/kmAUyBHs9IeSLqjXqfAfCsaP/vvLVOlSFomI+c1+g QfA4qEb8MCInvna9HskbeSTRXqLfh/dtBSofldBm83HnwHDrYGsTwxPndMuba+E5ZdGx ZWmlhVWUeR9TzEey1pPC9ghm1a+B+g3bbC8aKICjFw3XkL21HxFKE1LQbLiwd8uoEaCM EIbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719578611; x=1720183411; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xO1bF8/D+K6VmcsbNZerxp108u1S6w5egvhSyaxchUI=; b=F/qGYrH18GqdLb896rWZoWDVQvkdbrnojzxclKmkqr4U7+izWx8sCGIkzkrPvOY+C4 ojFUCVPlIIC8ShAyeQbQUexO1LtPgrGbuVFgOdZH0Zhva8mQatA9gLuIC00xfjWSafZ8 6htsY045/PayI4jGA/1vgk4iZKF/wBncg3qF/6zSqw9fMTd5e8vO+d0VzUjWXk8tRZ/w R48agswuTBt+jRR6OiC9XSLai7MiHIjh5Z7FQkmcHQtIuEPx38nw7jCL/gi/LRGK7Ezh IN/QmdMOJf+mYVg6v9FNW7BTNrDZk5jMC2nzROks9lyK9Z2kKQnBLO6d1WXPuuFUuDXs 3Jfw== X-Gm-Message-State: AOJu0YyMlyTkv3EBJ8zaYT7g81Med74jAfddvzVbKMYlITtRiJ1I0qEU al3y3Df0n829QlhHmoFJrrrXRb85FPMb7r7T/hLjVaFeyPBI9IUHonnnCw== X-Google-Smtp-Source: AGHT+IGe1nOyykNhj+8nD266y61XJTEXdE8O/bd+fWFIkDSh/LEIwGD7bidiPjKsPVdsNTKrgF2TJg== X-Received: by 2002:a5d:4982:0:b0:35f:d70:6193 with SMTP id ffacd0b85a97d-366e7a1111bmr10759823f8f.41.1719578611125; Fri, 28 Jun 2024 05:43:31 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3675a0cd66fsm2271224f8f.8.2024.06.28.05.43.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Jun 2024 05:43:29 -0700 (PDT) Message-Id: In-Reply-To: References: Date: Fri, 28 Jun 2024 12:43:23 +0000 Subject: [PATCH v3 3/5] sparse-index: use strbuf in path_found() Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The path_found() method previously reused strings from the cache entries the calling methods were using. This prevents string manipulation in place and causes some odd reallocation before the final lstat() call in the method. Refactor the method to use strbufs and copy the path into the strbuf, but also only the parent directory and not the whole path. This looks like extra copying when assigning the path to the strbuf, but we save an allocation by dropping the 'tmp' string, and we are "reusing" the copy from 'tmp' to put the data in the strbuf. Signed-off-by: Derrick Stolee --- sparse-index.c | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index de6e727f5c1..fec4f393360 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -440,31 +440,30 @@ void ensure_correct_sparsity(struct index_state *istate) } struct path_found_data { - const char *dirname; - size_t dir_len; + struct strbuf dir; int dir_found; }; #define PATH_FOUND_DATA_INIT { \ + .dir = STRBUF_INIT, \ .dir_found = 1 \ } static void clear_path_found_data(struct path_found_data *data) { - return; + strbuf_release(&data->dir); } static int path_found(const char *path, struct path_found_data *data) { struct stat st; char *newdir; - char *tmp; /* * If dirname corresponds to a directory that doesn't exist, and this * path starts with dirname, then path can't exist. */ - if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len)) + if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len)) return 0; /* @@ -486,17 +485,15 @@ static int path_found(const char *path, struct path_found_data *data) * If path starts with directory (which we already lstat'ed and found), * then no need to lstat parent directory again. */ - if (data->dir_found && data->dirname && - memcmp(path, data->dirname, data->dir_len)) + if (data->dir_found && data->dir.buf && + memcmp(path, data->dir.buf, data->dir.len)) return 0; /* Free previous dirname, and cache path's dirname */ - data->dirname = path; - data->dir_len = newdir - path + 1; + strbuf_reset(&data->dir); + strbuf_add(&data->dir, path, newdir - path + 1); - tmp = xstrndup(path, data->dir_len); - data->dir_found = !lstat(tmp, &st); - free(tmp); + data->dir_found = !lstat(data->dir.buf, &st); return 0; } From patchwork Fri Jun 28 12:43:24 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13716059 Received: from mail-wr1-f47.google.com (mail-wr1-f47.google.com [209.85.221.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 40CD9158D70 for ; Fri, 28 Jun 2024 12:43:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578616; cv=none; b=ZXmprMSXqUUSU/5x90rXkFuWWxli9NmUDfqcxsJHHoDmkO1fTgYRKQDPs7qPjq9hizDWRlxVQ0aw3qHUNsiYpYGwKAvZQFzadUxkFIrNQg7h3ooiX2y57673IcRuVOF7dItFlxvVR+1d3hRckNEzDYfiB46i3zeCuwIzWxEyBO4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578616; c=relaxed/simple; bh=v7pEIAOQacxrKcSmfYu3MMPjIoqKiKMlpNo3w47ds7I=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=ToUH0/LTVPtlbwMRGOmupWTe+goPxMKiuzmtA5SX6pogFwe06/BSn8wwDmOvcKgEIB+hM1fUlV2EFEXFEiOWUrEKdIh5Vnn5xl6kLiR2vG06Mg852/+QZZQARjh0BrF/HGcasH1xwymXE+8H1d/u3gADYccxzQmIPP0OUjLuBh0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=YUSUTwED; arc=none smtp.client-ip=209.85.221.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="YUSUTwED" Received: by mail-wr1-f47.google.com with SMTP id ffacd0b85a97d-363bd55bcc2so331120f8f.2 for ; Fri, 28 Jun 2024 05:43:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719578613; x=1720183413; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=KauGz/GdUXdGho5U7fUDm2FdMS4nlt8MegpX3RmXHxs=; b=YUSUTwEDZYyIgRrXpzqWO7Uo6sAx/TC/ruPj8XQIvz03W66aytqqL5fM+uLfoIOoNk 3DzxHSfo2/nu9dMbjMNQ7Y0KawwzWrjSc9ucLvpKUA2MwadkHhSp5QWS73ANovgmDkL8 KuNnodIrPtqUpzlkop7RYb2y0rUnFsESxavpaW1nkA+/g6W0BFgci8n0TTdazGhd5D5j sNn9VxA7W/ObwG9RxszUkxje5GIKcm1N4mJROrybJp/D1StzfJri6inKZwqAB5avMe3t Cqz0/psNwZRhB9d4vlHNU0TNVjZSgBf4M1HyINmPLxgebDCDhoW8c/ulSjDUFGPJvvPW iiLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719578613; x=1720183413; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KauGz/GdUXdGho5U7fUDm2FdMS4nlt8MegpX3RmXHxs=; b=G/zWwS9cKXUNxsWJs7t1IeoGeQNL/W8seW7jiKzK5Jd1oYjQQISF6qk/Sb6ctmQqg5 zPI3W10/y5TRLT3Y8VuDdt5FR0wtd8OJPe2quFqoRugpFnPWBpwgbpegY/LUNAV+gIFL VuPKGGrFXQqRY+nCgjt4HAs5h3NCh9LrdwQ7KjE71CP1sZProXBIunYMt26/uFUm2pHI 1/OmsOvChWQZPFZs3ThxIxumXuG3+IIpYjvOQSG+hb6zcFQvZGBX61V6/SnxjFJ6fE8R 5XcvZgeUjcTYJ21KcFXLS0/ICRM5Wy1qWIwT1WkHa0PJtQ7z5HUufGn+hR05odXWgz8y XfgQ== X-Gm-Message-State: AOJu0YzmOf7jFyiZV+PvPDldSUgfPoXG08ABGEzzQTCkpwP2jW7EWDaB 9HU4fz/Ku8R6uP6pOEh6kfDjUsJmPblIbX6Rr8YuHHisOqUEX3ieMqFn8g== X-Google-Smtp-Source: AGHT+IF+zwXZ4F3qsl8YgX7CJ4geFaW6gXd4kvUNerPew2OWZ9g85K53v1lDwOagq6Rvp3V11VGeCA== X-Received: by 2002:a05:6000:b44:b0:35f:11c5:5c74 with SMTP id ffacd0b85a97d-366e94cc54amr10031556f8f.36.1719578612859; Fri, 28 Jun 2024 05:43:32 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3675a0fc5f4sm2216498f8f.75.2024.06.28.05.43.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Jun 2024 05:43:32 -0700 (PDT) Message-Id: In-Reply-To: References: Date: Fri, 28 Jun 2024 12:43:24 +0000 Subject: [PATCH v3 4/5] sparse-index: count lstat() calls Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The clear_skip_worktree.. methods already report some statistics about how many cache entries are checked against path_found() due to having the skip-worktree bit set. However, due to path_found() performing some caching, this isn't the only information that would be helpful to report. Add a new lstat_count member to the path_found_data struct to count the number of times path_found() calls lstat(). This will be helpful to help explain performance problems in this method as well as to demonstrate future changes to the caching algorithm in a more concrete way than end-to-end timings. Signed-off-by: Derrick Stolee --- sparse-index.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/sparse-index.c b/sparse-index.c index fec4f393360..8577fa726b8 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -442,6 +442,7 @@ void ensure_correct_sparsity(struct index_state *istate) struct path_found_data { struct strbuf dir; int dir_found; + size_t lstat_count; }; #define PATH_FOUND_DATA_INIT { \ @@ -469,6 +470,7 @@ static int path_found(const char *path, struct path_found_data *data) /* * If path itself exists, return 1. */ + data->lstat_count++; if (!lstat(path, &st)) return 1; @@ -493,6 +495,7 @@ static int path_found(const char *path, struct path_found_data *data) strbuf_reset(&data->dir); strbuf_add(&data->dir, path, newdir - path + 1); + data->lstat_count++; data->dir_found = !lstat(data->dir.buf, &st); return 0; @@ -524,6 +527,8 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist trace2_data_intmax("index", istate->repo, "sparse_path_count", path_count); + trace2_data_intmax("index", istate->repo, + "sparse_lstat_count", data.lstat_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse", istate->repo); clear_path_found_data(&data); @@ -553,6 +558,8 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista trace2_data_intmax("index", istate->repo, "full_path_count", path_count); + trace2_data_intmax("index", istate->repo, + "full_lstat_count", data.lstat_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_full", istate->repo); clear_path_found_data(&data); From patchwork Fri Jun 28 12:43:25 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13716060 Received: from mail-wm1-f41.google.com (mail-wm1-f41.google.com [209.85.128.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E63E7158D7D for ; Fri, 28 Jun 2024 12:43:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.41 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578617; cv=none; b=AX87wwZ5u67duKJxCQrTdoilCe4mGGlak3eAV5ZhFS/+YzmpF7rFMAgeJIOTpngAqMJ8WTO0Wnoq2QGq26GhuH4KEAeKGb2LsmJsKchHQ7gWVZIyU9eCMeKF+ufJw6RjERp9uleMbZuOR14pjAm4FplS3VIXkjOh9lSr4KAUgqE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719578617; c=relaxed/simple; bh=GS7+3isRBYjlXCv2rqCVhrgzzOewamGPzXuJROXLsIk=; h=Message-Id:In-Reply-To:References:From:Date:Subject:MIME-Version: Content-Type:To:Cc; b=rCLC9e16+oQsgXsBeYvVFtMivHjoESCEML0fz21uwM8L0xRa4vkmHFkvBbDH6gL7VNrsq0Yc3ogu1wOU87IwYburS3QYqqqSPKM7JlF8ptBW4CSSaifWRt7oQmyMNHUm1YI1d2CnZ6Fl1vf4ikG6djAMmCiZK4tjO1ddTdVEzt4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=BeyiX5Y1; arc=none smtp.client-ip=209.85.128.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="BeyiX5Y1" Received: by mail-wm1-f41.google.com with SMTP id 5b1f17b1804b1-4256eec963eso3574935e9.1 for ; Fri, 28 Jun 2024 05:43:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719578614; x=1720183414; darn=vger.kernel.org; h=cc:to:fcc:content-transfer-encoding:mime-version:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=SERYMe5VcXolwDXjtBggm7ICzEQaVY7eP5VLboZJyZw=; b=BeyiX5Y1j64y701hYFaCjl2lpHPFmM35zI+/VqlXrSnfuk8sQPzfnfTWrjtJjYj9uW jVaGT/ufjr9YNq/uEsD1SYGjQ+NdRowG5iNS7b3maeqz+4kpX30HB4Dupu3GbuWDZmA+ 1/to4XTt2p6nXBYBHCZDuXjI2/A6vn63Jby7LPN3f4uUWoeCtPWXdeOHjOtTK9ui/Gw8 O6SzhILNNsVQ3M3j5mmQDmwMkPHGzfTNl3XaciQKEh4ZEzAcCnyHLuq27IfqvYkqJ4tu 6kb8aYY9E74qjfVuOPBNdZcY8RUS7P4GGW3SLKDOLsqKpmuooy+ivewefk3p4WR8hYjn IP4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719578614; x=1720183414; h=cc:to:fcc:content-transfer-encoding:mime-version:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SERYMe5VcXolwDXjtBggm7ICzEQaVY7eP5VLboZJyZw=; b=k4II+vOfIBTVbQJv+nAEBJAIU8JX5CfwNIGfJ6EREbJhWRmz7v1WMrw5WHW8MmgDUn SLzCXg1GbJWYsyiAUuPrFNW+JG4qZfW43zfXB5EQ49Dwi4Dmcs4FrgrPPEvOH/M0rCvz QW+QxJduWIXtnsQfuW9OnIfnN8iFB23qAzpu4txfG3uOjmKCY/ziJCIgTh6MR4R4kOu6 2/flS5Sq17RTqq7h2u8thBIpQ+u1fQDJsuscXwhOEz1JzwYhwPt9ivok4RJyqZwNs728 m+ZrItbPnHcSak4G7eX0wm3TRPxCg/KCrrCEhm8/b5YtTICu2AWGovo08e0Jxm2npovA YFUg== X-Gm-Message-State: AOJu0YwA8OhG5MnISDWR/1v8+G1i8HxnkyNZ5RFEOJzpKotiyZ2ttsLJ C/4PlZRJx8Ilnpv9E3OUuOQVzMa6EoPCVrHdx40kU5VYOT+hEgH4FijSdw== X-Google-Smtp-Source: AGHT+IH9qe4GWucmqq5XMLE/1XFzHJnoQaXhO95OBC7MnYhhohfsI+XUNWKLcRx3POcDG5MBrWbvCg== X-Received: by 2002:a05:600c:43d4:b0:425:6526:73f4 with SMTP id 5b1f17b1804b1-42565267473mr33338655e9.38.1719578613752; Fri, 28 Jun 2024 05:43:33 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3675a0fbbd4sm2225462f8f.84.2024.06.28.05.43.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Jun 2024 05:43:33 -0700 (PDT) Message-Id: <1f58e19691f2dd6ff525cc19c98158861d6fe411.1719578605.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Fri, 28 Jun 2024 12:43:25 +0000 Subject: [PATCH v3 5/5] sparse-index: improve lstat caching of sparse paths Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The clear_skip_worktree_from_present_files() method was first introduced in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14) to allow better interaction with the working directory in the presence of paths outside of the sparse-checkout. The initial implementation would lstat() every single SKIP_WORKTREE path to see if it existed; if it ran across a sparse directory that existed (when a sparse index was in use), then it would expand the index and then check every SKIP_WORKTREE path. Since these lstat() calls were very expensive, this was improved in d79d299352 (Accelerate clear_skip_worktree_from_present_files() by caching, 2022-01-14) by caching directories that do not exist so it could avoid lstat()ing any files under such directories. However, there are some inefficiencies in that caching mechanism. The caching mechanism stored only the parent directory as not existing, even if a higher parent directory also does not exist. This means that wasted lstat() calls would occur when the paths passed to path_found() change immediate parent directories but within the same parent directory that does not exist. To create an example repository that demonstrates this problem, it helps to have a directory outside of the sparse-checkout that contains many deep paths. In particular, the first paths (in lexicographic order) underneath the sparse directory should have deep directory structures, maximizing the difference between the old caching algorithm that looks to a single parent and the new caching algorithm that looks to the top-most missing directory. The performance test script p2000-sparse-operations.sh takes the sample repository and copies its HEAD to several copies nested in directories of the form f/f/f where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior and also lead to some interesting cases for the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do not. This is difficult to notice when running performance tests using the Git repository (or a blow-up of the Git repository, as in p2000-sparse-operations.sh) because Git has a very shallow directory structure. This change reorganizes the caching algorithm to focus on storing the highest level leading directory that does not exist; specifically this means that that directory's parent _does_ exist. By doing a little extra work on a path passed to path_found(), we can short-circuit all of the paths passed to path_found() afterwards that match a prefix with that non-existing directory. When in a repository where the first sparse file is likely to have a much deeper path than the first non-existing directory, this can realize significant gains. The details of this algorithm require careful attention, so the new implementation of path_found() has detailed comments, including the use of a new max_common_dir_prefix() method that may be of independent interest. It's worth noting that this is not universally positive, since we are doing extra lstat() calls to establish the exact path to cache. In the blow-up of the Git repository, we can see that the lstat count _increases_ from 28 to 31. However, these numbers were already artificially low. Contributor Elijah Newren created a publicly-available test repository that demonstrates the difference in these caching algorithms in the most extreme way. To test, follow these steps: git clone --sparse https://github.com/newren/gvfs-like-git-bomb cd gvfs-like-git-bomb ./runme.sh # NOTE: check scripts before running! At this point, assuming you do not have index.sparse=true set globally, the index has one million paths with the SKIP_WORKTREE bit and they will all be sent to path_found() in the sparse loop. You can measure this by running 'git status' with GIT_TRACE2_PERF=1: Sparse files in the index: 1,000,000 sparse_lstat_count (before): 200,000 sparse_lstat_count (after): 2 And here are the performance numbers: Benchmark 1: old Time (mean ± σ): 397.5 ms ± 4.1 ms Range (min … max): 391.2 ms … 404.8 ms 10 runs Benchmark 2: new Time (mean ± σ): 252.7 ms ± 3.1 ms Range (min … max): 249.4 ms … 259.5 ms 11 runs Summary 'new' ran 1.57 ± 0.02 times faster than 'old' By modifying this example further, we can demonstrate a more realistic example and include the sparse index expansion. Continue by creating this directory, confusing both caching algorithms somewhat: mkdir -p bomb/d/e/f/a/a Then re-run the 'git status' tests to see these statistics: Sparse files in the index: 1,000,000 sparse_lstat_count (before): 724,010 sparse_lstat_count (after): 106 Benchmark 1: old Time (mean ± σ): 753.0 ms ± 3.5 ms Range (min … max): 749.7 ms … 760.9 ms 10 runs Benchmark 2: new Time (mean ± σ): 201.4 ms ± 3.2 ms Range (min … max): 196.0 ms … 207.9 ms 14 runs Summary 'new' ran 3.74 ± 0.06 times faster than 'old' Note that if this repository had a sparse index enabled, the additional cost of expanding the sparse index affects the total time of these commands by over four seconds, significantly diminishing the benefit of the caching algorithm. Having existing paths outside of the sparse-checkout is a known performance issue for the sparse index and is a known trade-off for the performance benefits given when no such paths exist. Using an internal monorepo with over two million paths at HEAD and a typical sparse-checkout cone such that the sparse index contains ~190,000 entries (including over two thousand sparse trees), I was able to measure these lstat counts when one sparse directory actually exists on disk: Sparse files in expanded index: 1,841,997 full_lstat_count (before): 1,188,161 full_lstat_count (after): 4,404 This resulted in this absolute time change, on a warm disk: Time in full loop (before): 13.481 s Time in full loop (after): 0.081 s (These times were calculated on a Windows machine, where lstat() is slower than a similar Linux machine.) Helped-by: Elijah Newren Signed-off-by: Derrick Stolee --- sparse-index.c | 114 ++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 90 insertions(+), 24 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index 8577fa726b8..9913a6078cd 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -440,14 +440,21 @@ void ensure_correct_sparsity(struct index_state *istate) } struct path_found_data { + /** + * The path stored in 'dir', if non-empty, corresponds to the most- + * recent path that we checked where: + * + * 1. The path should be a directory, according to the index. + * 2. The path does not exist. + * 3. The parent path _does_ exist. (This may be the root of the + * working directory.) + */ struct strbuf dir; - int dir_found; size_t lstat_count; }; #define PATH_FOUND_DATA_INIT { \ - .dir = STRBUF_INIT, \ - .dir_found = 1 \ + .dir = STRBUF_INIT \ } static void clear_path_found_data(struct path_found_data *data) @@ -455,49 +462,108 @@ static void clear_path_found_data(struct path_found_data *data) strbuf_release(&data->dir); } +/** + * Return the length of the longest common substring that ends in a + * slash ('/') to indicate the longest common parent directory. Returns + * zero if no common directory exists. + */ +static size_t max_common_dir_prefix(const char *path1, const char *path2) +{ + size_t common_prefix = 0; + for (size_t i = 0; path1[i] && path2[i]; i++) { + if (path1[i] != path2[i]) + break; + + /* + * If they agree at a directory separator, then add one + * to make sure it is included in the common prefix string. + */ + if (path1[i] == '/') + common_prefix = i + 1; + } + + return common_prefix; +} + static int path_found(const char *path, struct path_found_data *data) { struct stat st; - char *newdir; + size_t common_prefix; /* - * If dirname corresponds to a directory that doesn't exist, and this - * path starts with dirname, then path can't exist. + * If data->dir is non-empty, then it contains a path that doesn't + * exist, including an ending slash ('/'). If it is a prefix of 'path', + * then we can return 0. */ - if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len)) + if (data->dir.len && !memcmp(path, data->dir.buf, data->dir.len)) return 0; /* - * If path itself exists, return 1. + * Otherwise, we must check if the current path exists. If it does, then + * return 1. The cached directory will be skipped until we come across + * a missing path again. */ data->lstat_count++; if (!lstat(path, &st)) return 1; /* - * Otherwise, path does not exist so we'll return 0...but we'll first - * determine some info about its parent directory so we can avoid - * lstat calls for future cache entries. + * At this point, we know that 'path' doesn't exist, and we know that + * the parent directory of 'data->dir' does exist. Let's set 'data->dir' + * to be the top-most non-existing directory of 'path'. If the first + * parent of 'path' exists, then we will act as though 'path' + * corresponds to a directory (by adding a slash). */ - newdir = strrchr(path, '/'); - if (!newdir) - return 0; /* Didn't find a parent dir; just return 0 now. */ + common_prefix = max_common_dir_prefix(path, data->dir.buf); /* - * If path starts with directory (which we already lstat'ed and found), - * then no need to lstat parent directory again. + * At this point, 'path' and 'data->dir' have a common existing parent + * directory given by path[0..common_prefix] (which could have length 0). + * We "grow" the data->dir buffer by checking for existing directories + * along 'path'. */ - if (data->dir_found && data->dir.buf && - memcmp(path, data->dir.buf, data->dir.len)) - return 0; - /* Free previous dirname, and cache path's dirname */ - strbuf_reset(&data->dir); - strbuf_add(&data->dir, path, newdir - path + 1); + strbuf_setlen(&data->dir, common_prefix); + while (1) { + /* Find the next directory in 'path'. */ + const char *rest = path + data->dir.len; + const char *next_slash = strchr(rest, '/'); - data->lstat_count++; - data->dir_found = !lstat(data->dir.buf, &st); + /* + * If there are no more slashes, then 'path' doesn't contain a + * non-existent _parent_ directory. Set 'data->dir' to be equal + * to 'path' plus an additional slash, so it can be used for + * caching in the future. The filename of 'path' is considered + * a non-existent directory. + * + * Note: if "{path}/" exists as a directory, then it will never + * appear as a prefix of other callers to this method, assuming + * the context from the clear_skip_worktree... methods. If this + * method is reused, then this must be reconsidered. + */ + if (!next_slash) { + strbuf_addstr(&data->dir, rest); + strbuf_addch(&data->dir, '/'); + break; + } + /* + * Now that we have a slash, let's grow 'data->dir' to include + * this slash, then test if we should stop. + */ + strbuf_add(&data->dir, rest, next_slash - rest + 1); + + /* If the parent dir doesn't exist, then stop here. */ + data->lstat_count++; + if (lstat(data->dir.buf, &st)) + return 0; + } + + /* + * At this point, 'data->dir' is equal to 'path' plus a slash character, + * and the parent directory of 'path' definitely exists. Moreover, we + * know that 'path' doesn't exist, or we would have returned 1 earlier. + */ return 0; }