From patchwork Wed Jun 26 14:29:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13713030 Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com [209.85.208.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ED0B4185E7E for ; Wed, 26 Jun 2024 14:29:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412198; cv=none; b=VnkWMpQ0jcvbUQg88R2EvBGletatQsJagKM/A4nAYIToKvCnUpADFB9lEIYdWVep5UHOA4VtZ6sbDwCem3RgnP7MOiRyvljZhfjbuq+d/UrRlgpe+it0r+4QIvotY9s310beB3gDm0EkPWMBXtFbUyHigGcAp5xfJxtAqdUQ+jM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412198; c=relaxed/simple; bh=TFfNkbYDVJxfPTrj8v6tFtPvUjpRt7Zyk9Mih8HHgXw=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=KILpeNiWRIQw+8ujhDUvYoxtuTGEdQc1Inh1d30I0gN8uuC6h9eljo6dvJDtwu170V2wiNmUPQctcEQGiMw2RV1ny5CYhvmgWC6Tey7SkzhrNSE0W6KZWpxmVkFnResEmDYqt02aBRW1/vesDnDYlr3HxpMctGwr1zR5dAeyH54= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=MABWh4ij; arc=none smtp.client-ip=209.85.208.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="MABWh4ij" Received: by mail-lj1-f181.google.com with SMTP id 38308e7fff4ca-2ec50a5e230so51052611fa.0 for ; Wed, 26 Jun 2024 07:29:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719412195; x=1720016995; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=icp2y/t/djF1Zsvl1v8rCgl2PpfJnBbqITG5R4+I9HU=; b=MABWh4ijO4OhoNDuFxlvgidYwHALZrqNMo40Ueuub8OQXek7GRtBv11HsIWRDEPRFc cS5SRtmjEWEKNp3IBJasgQsGK1CUgWtlUn6780ZKWjOOuETfZTNf66IfOoVKUsNJtoXF Uz7WSY1ptwFCUS70PNIQhSN6glsS0X1hQ32QLLcafnkURQK/Jb0XaBtSfi9cWvXouZ7g oX9KBa48cf1kUCQUJlHW9WDpKAimEYxESUkxSt02ANvBaaDyu1IJbA6RnZkAwL3AhEUy CsN1WOxGlmMChZJZJmKafBB0/E9wPgrGGqnuTk1Tz166vOjrCTIjmUHdpTqlYTiSW4HS o93A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719412195; x=1720016995; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=icp2y/t/djF1Zsvl1v8rCgl2PpfJnBbqITG5R4+I9HU=; b=r/lLed2at3Fsb2R1bN+W58jMgXeteiKVuzZz/oeke0yMX05h6hDDjCcUmOzAqPjcwO ytfP6cOgX9a0n4c+/Jj0nGCBpJAHHvueXHsdvCpkR/HW78nZDImlKzX/lD738wAQybpn nThHIsKwFICgOcff3pbzXFU03NdwUkUDknwJb1x4rc6lkVTABEz3TP0ci/qw+6obUaYZ 8GsVhtdxel/deyr4FnRU/cKNTOlkDGjH2FSJf34C9eRB4edg0Rk93NuN5aVMPVqkJN5N 5ntxhOARw64RW7Sir0eLR0OkaIOBfYG21SPc6xxV9lcfvFz3uVgXfxj9oMGSZT+GGSPU IJiA== X-Gm-Message-State: AOJu0YxKibd8irS6JjvrLL/rnj242qLs9Ou5GV7h/YQQSF5WRDdglts8 p4NanG86FU9IXkvTxC4LsWbWU6tMs6TliyaFyI+brXTmeSMtO+A0EmNygA== X-Google-Smtp-Source: AGHT+IGfPlQbZzvSdAohm4RX9WTvGbtoEnQkSTvegknAKmZkDMfOcXa6eJzIKiWDi5D5xmZTm6qJ4A== X-Received: by 2002:a19:4312:0:b0:52b:e800:25d8 with SMTP id 2adb3069b0e04-52ce183559fmr6251321e87.25.1719412194389; Wed, 26 Jun 2024 07:29:54 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-424c823c2c1sm28135885e9.7.2024.06.26.07.29.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Jun 2024 07:29:54 -0700 (PDT) Message-Id: <93d0baed0b0f435e5656cef04cf103b5e2e0f41a.1719412192.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Wed, 26 Jun 2024 14:29:47 +0000 Subject: [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The clear_skip_worktree_from_present_files() method was introduced in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14) to help cases where sparse-checkout is enabled but some paths outside of the sparse-checkout also exist on disk. This operation can be slow as it needs to check path existence in a way not stored in the index, so caching was introduced in d79d299352 (Accelerate clear_skip_worktree_from_present_files() by caching, 2022-01-14). If users are having trouble with the performance of this operation and don't care about paths outside of the sparse-checkout, they can disable them using the sparse.expectFilesOutsideOfPatterns config option introduced in ecc7c8841d (repo_read_index: add config to expect files outside sparse patterns, 2022-02-25). This check is particularly confusing in the presence of a sparse index, as a sparse tree entry corresponding to an existing directory must first be expanded to a full index before examining the paths within. This is currently implemented using a 'goto' and a boolean variable to ensure we restart only once. Even with that caching, it was noticed that this could take a long time to execute. 89aaab11a3 (index: add trace2 region for clear skip worktree, 2022-11-03) introduced trace2 regions to measure this time. Further, the way the loop repeats itself was slightly confusing and prone to breakage, so a BUG() statement was added in 8c7abdc596 (index: raise a bug if the index is materialised more than once, 2022-11-03) to be sure that the second run of the loop does not hit any sparse trees. One thing that can be confusing about the current setup is that the trace2 regions nest and it is not clear that a second loop is running after a sparse index is expanded. Here is an example of what the regions look like in a typical case: | region_enter | ... | label:clear_skip_worktree_from_present_files | region_enter | ... | ..label:update | region_leave | ... | ..label:update | region_enter | ... | ..label:ensure_full_index | region_enter | ... | ....label:update | region_leave | ... | ....label:update | region_leave | ... | ..label:ensure_full_index | data | ... | ..sparse_path_count:1 | data | ... | ..sparse_path_count_full:269538 | region_leave | ... | label:clear_skip_worktree_from_present_files One thing that is particularly difficult to understand about these regions is that most of the time is spent between the close of the ensure_full_index region and the reporting of the end data. This is because of the restart of the loop being within the same region as the first iteration of the loop. This change refactors the method into two separate methods that are traced separately. This will be more important later when we change other features of the methods, but for now the only functional change is the difference in the structure of the trace regions. After this change, the same telemetry section is split into three distinct chunks: | region_enter | ... | label:clear_skip_worktree_from_present_files_sparse | data | ... | ..sparse_path_count:1 | region_leave | ... | label:clear_skip_worktree_from_present_files_sparse | region_enter | ... | label:update | region_leave | ... | label:update | region_enter | ... | label:ensure_full_index | region_enter | ... | ..label:update | region_leave | ... | ..label:update | region_leave | ... | label:ensure_full_index | region_enter | ... | label:clear_skip_worktree_from_present_files_full | data | ... | ..full_path_count:269538 | region_leave | ... | label:clear_skip_worktree_from_present_files_full Here, we see the sparse loop terminating early with its first sparse path being a sparse directory containing a file. Then, that loop's region terminates before ensure_full_index begins (in this case, the cache-tree must also be computed). Then, _after_ the index is expanded, the full loop begins with its own region. Signed-off-by: Derrick Stolee --- sparse-index.c | 77 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 53 insertions(+), 24 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index e48e40cae71..e0457c87fff 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -486,49 +486,78 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, return 0; } -void clear_skip_worktree_from_present_files(struct index_state *istate) +static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate) { const char *last_dirname = NULL; size_t dir_len = 0; int dir_found = 1; - int i; - int path_count[2] = {0, 0}; - int restarted = 0; + int path_count = 0; + int to_restart = 0; - if (!core_apply_sparse_checkout || - sparse_expect_files_outside_of_patterns) - return; - - trace2_region_enter("index", "clear_skip_worktree_from_present_files", + trace2_region_enter("index", "clear_skip_worktree_from_present_files_sparse", istate->repo); -restart: - for (i = 0; i < istate->cache_nr; i++) { + for (int i = 0; i < istate->cache_nr; i++) { struct cache_entry *ce = istate->cache[i]; if (ce_skip_worktree(ce)) { - path_count[restarted]++; + path_count++; if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { if (S_ISSPARSEDIR(ce->ce_mode)) { - if (restarted) - BUG("ensure-full-index did not fully flatten?"); - ensure_full_index(istate); - restarted = 1; - goto restart; + to_restart = 1; + break; } ce->ce_flags &= ~CE_SKIP_WORKTREE; } } } - if (path_count[0]) - trace2_data_intmax("index", istate->repo, - "sparse_path_count", path_count[0]); - if (restarted) - trace2_data_intmax("index", istate->repo, - "sparse_path_count_full", path_count[1]); - trace2_region_leave("index", "clear_skip_worktree_from_present_files", + trace2_data_intmax("index", istate->repo, + "sparse_path_count", path_count); + trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse", + istate->repo); + return to_restart; +} + +static void clear_skip_worktree_from_present_files_full(struct index_state *istate) +{ + const char *last_dirname = NULL; + size_t dir_len = 0; + int dir_found = 1; + + int path_count = 0; + + trace2_region_enter("index", "clear_skip_worktree_from_present_files_full", istate->repo); + for (int i = 0; i < istate->cache_nr; i++) { + struct cache_entry *ce = istate->cache[i]; + + if (S_ISSPARSEDIR(ce->ce_mode)) + BUG("ensure-full-index did not fully flatten?"); + + if (ce_skip_worktree(ce)) { + path_count++; + if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) + ce->ce_flags &= ~CE_SKIP_WORKTREE; + } + } + + trace2_data_intmax("index", istate->repo, + "full_path_count", path_count); + trace2_region_leave("index", "clear_skip_worktree_from_present_files_full", + istate->repo); +} + +void clear_skip_worktree_from_present_files(struct index_state *istate) +{ + if (!core_apply_sparse_checkout || + sparse_expect_files_outside_of_patterns) + return; + + if (clear_skip_worktree_from_present_files_sparse(istate)) { + ensure_full_index(istate); + clear_skip_worktree_from_present_files_full(istate); + } } /* From patchwork Wed Jun 26 14:29:48 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13713031 Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A442945C07 for ; Wed, 26 Jun 2024 14:29:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412199; cv=none; b=EE+IlNndus32p4L1A6O51elOx2xVAmCR4I0SIF/HgeqwFZ6rLvxs/BhdkyLuPIgmB1LHGOsN2tvbZdxvXGBKQkt+R/3dv11oO+iJdllqQBTK7gHg5b8xcNmwkD4cT8/722rpfgVNZGuq1AMl96DXe0B/7jpuTLFAzUgeUd26hAc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412199; c=relaxed/simple; bh=Qz+nDsLZj+c5Zu02rUyu+EGfNUaXxkn9Om0dikyFjRQ=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=PbDoS8EBR0LJRfluhUng6htukODlZwUmYMlTDGxyr9TjW6ko9H6fwUEgk86SgwSvk9gHvIDN86wfmWR/RHdTKpMJxFwK0MSJYNFZd20G32eXpiUAD9a7blfoNwv4o7zjZEhhHDS+E6MMFIGQc/WbnS6XsB4D87x8Rv0Tv2Kvin8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=na2CWu5b; arc=none smtp.client-ip=209.85.208.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="na2CWu5b" Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-2ec10324791so81956291fa.1 for ; Wed, 26 Jun 2024 07:29:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719412195; x=1720016995; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=rEaRIqkrn/nVU9/+27kkujxPuiy8ceVVFNZtvynEzTw=; b=na2CWu5bTXifeGzjjoKmXB5PBaKpunSsynNRsX4vvJdJJ01VXDDeuIMOt4F6jGF5UT Yj9t7y5nuNcOQKWoLXpXzC3WUR3BmhwhmPtute4HFlq5sDG7V/qjg8igidk/H6qvF32M 9u87me5SnI+6zVHs+3UGRLbEUorgaMEkPLUtI9TdRKTnm1NOpjZRkRwkHwXvzD98JYNB sIwup11c1ftjNundLuLVt8Af6SbGci/U0QZA8FI+1LeXvxKAkKk3dVd7UlV9ug+KQUZL Bc1kwUUZyE8CxjhYMRjqcD+psdqNwiW6hYRquxoN4jtJYOJ4dn5Ns2XYFVa2+iAinpoZ uYTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719412195; x=1720016995; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rEaRIqkrn/nVU9/+27kkujxPuiy8ceVVFNZtvynEzTw=; b=a2JaeWlMsFCcJ1xioy6NNmIrTxX7VTC9LXSta/a0MSah39YX5pd5Di4aBTsN13USHe OQRmpOVhlYQgo71cZZN2F8kA9rxOplHr5B0krTywbOo72C6nWlp949sQnxVB6NaFmy1B 8dW/ngrBQcrfV+dFDI1f2Z0p6R7VZV8De+lhxSiEq5BkMD+wyWQqr9MU2AB4kShZet4L PE5OVIXWPZWA5EI0BOdWHkKhDFxV/UfQE7UEw6clm61IkVj3G/gBc1l2g9ieWaXmyX9M m1ZhnLzcEZquH7Jfm4MRItTx2+PPlieWB+OwTP+XIij9ja1vzVcUChVItT6SttG9oXVD Eu0w== X-Gm-Message-State: AOJu0YzeH4p3Vwi0aAxBCdHSYeE5eh+NXN953eUDsgRa5zRkMM7BQRyD 5EkPKdHnGzHnPaHmDdV1Pkc/42n7MqhVm5xg6fhA0M2hDxQ7bZ6CI19j+A== X-Google-Smtp-Source: AGHT+IHDW2A2mQ1ExPUKIgbYGplSk6mViBVUbT6k30odTm+lgudzPRa43b+D9F9MDAfw8hk8UQeLUw== X-Received: by 2002:a2e:9f46:0:b0:2eb:d924:43fb with SMTP id 38308e7fff4ca-2ec5b2dd3ebmr61423571fa.41.1719412195192; Wed, 26 Jun 2024 07:29:55 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-366f624cdacsm8038206f8f.70.2024.06.26.07.29.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Jun 2024 07:29:54 -0700 (PDT) Message-Id: <69c3beaabf7ca8ccf5bbaa248328f77b1be915ba.1719412192.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Wed, 26 Jun 2024 14:29:48 +0000 Subject: [PATCH v2 2/5] sparse-index: refactor path_found() Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee In advance of changing the behavior of path_found(), take all of the intermediate data values and group them into a single struct. This simplifies the method prototype as well as the initialization. Future changes can be made directly to the struct and method without changing the callers with this approach. Note that the clear_path_found_data() method is currently empty, as there is nothing to free. This method is a placeholder for future changes that require a non-trivial implementation. Its stub is created now so consumers could call it now and not change in future changes. Signed-off-by: Derrick Stolee --- sparse-index.c | 45 +++++++++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index e0457c87fff..de6e727f5c1 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -439,8 +439,22 @@ void ensure_correct_sparsity(struct index_state *istate) ensure_full_index(istate); } -static int path_found(const char *path, const char **dirname, size_t *dir_len, - int *dir_found) +struct path_found_data { + const char *dirname; + size_t dir_len; + int dir_found; +}; + +#define PATH_FOUND_DATA_INIT { \ + .dir_found = 1 \ +} + +static void clear_path_found_data(struct path_found_data *data) +{ + return; +} + +static int path_found(const char *path, struct path_found_data *data) { struct stat st; char *newdir; @@ -450,7 +464,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, * If dirname corresponds to a directory that doesn't exist, and this * path starts with dirname, then path can't exist. */ - if (!*dir_found && !memcmp(path, *dirname, *dir_len)) + if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len)) return 0; /* @@ -472,15 +486,16 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, * If path starts with directory (which we already lstat'ed and found), * then no need to lstat parent directory again. */ - if (*dir_found && *dirname && memcmp(path, *dirname, *dir_len)) + if (data->dir_found && data->dirname && + memcmp(path, data->dirname, data->dir_len)) return 0; /* Free previous dirname, and cache path's dirname */ - *dirname = path; - *dir_len = newdir - path + 1; + data->dirname = path; + data->dir_len = newdir - path + 1; - tmp = xstrndup(path, *dir_len); - *dir_found = !lstat(tmp, &st); + tmp = xstrndup(path, data->dir_len); + data->dir_found = !lstat(tmp, &st); free(tmp); return 0; @@ -488,9 +503,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len, static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate) { - const char *last_dirname = NULL; - size_t dir_len = 0; - int dir_found = 1; + struct path_found_data data = PATH_FOUND_DATA_INIT; int path_count = 0; int to_restart = 0; @@ -502,7 +515,7 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist if (ce_skip_worktree(ce)) { path_count++; - if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { + if (path_found(ce->name, &data)) { if (S_ISSPARSEDIR(ce->ce_mode)) { to_restart = 1; break; @@ -516,14 +529,13 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist "sparse_path_count", path_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse", istate->repo); + clear_path_found_data(&data); return to_restart; } static void clear_skip_worktree_from_present_files_full(struct index_state *istate) { - const char *last_dirname = NULL; - size_t dir_len = 0; - int dir_found = 1; + struct path_found_data data = PATH_FOUND_DATA_INIT; int path_count = 0; @@ -537,7 +549,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista if (ce_skip_worktree(ce)) { path_count++; - if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) + if (path_found(ce->name, &data)) ce->ce_flags &= ~CE_SKIP_WORKTREE; } } @@ -546,6 +558,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista "full_path_count", path_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_full", istate->repo); + clear_path_found_data(&data); } void clear_skip_worktree_from_present_files(struct index_state *istate) From patchwork Wed Jun 26 14:29:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13713032 Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 25299186285 for ; Wed, 26 Jun 2024 14:29:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412199; cv=none; b=OfKM08IarJFcWV8bDOui09DLILLes/Qdtnd2tJWsltFcX8PTOsQvcPcfWMzG9N6Okpf/Zds/Gk5KkaMnMFK8sTx/TRkjlH0FJyLDin6WIw0E7ncISwipc330u9dd7LY/rXvmQI3BJTo74FUJGTVGtOn4/GRAemHXHyZqsLI7jmI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412199; c=relaxed/simple; bh=mKI90b/YGFpefVFGNK/0fVqAD9pGjE9PJw08JNb6Gps=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=OYc21PmdB4WFT/sI1M932i96rSiAYo8aNr1BmvEiiCfSLJMhUBekY08eUqu1HDTK4EVkvGh0KvH8/qYFNuRSHCT8SidZDVzyduK6wwlgGuN5bOKii5r593cgRjdl77C880YYV0TcZTSm5HBZUuPTb08cec9gmUoKEkSRPQpg5K0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=TozZHU81; arc=none smtp.client-ip=209.85.208.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="TozZHU81" Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-2ec52fbb50cso45795461fa.2 for ; Wed, 26 Jun 2024 07:29:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719412196; x=1720016996; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=xO1bF8/D+K6VmcsbNZerxp108u1S6w5egvhSyaxchUI=; b=TozZHU815mc3jQLI8olFS4/er/3wW4rrjxvTIWh3EEEFCqBbdv/a3wmwr5q1Q57jil p2psSZCpNzQouGFI4ZZASJo6VtwbpJOytMg9i/WFvOExxnDttUL/Uny4JnQqFXCc66Wc 6MiSp2jdud6yxdSDpapEFb3bbndUOh9+GVcKoYbKjIjb30DNYCNFxsBskBx0Sve07dq8 bZKgcEq21UzIxdDzNuIWLWBmuz93OdkhDI3qNEwKgi3Wkvh+7TviA2KsN7rxo7Jm1/sa c9Uh6cX9vRTmJdBf8LVMYM8y5LzREr6Ly2qjtLmqwCXwzFFBA+5HiyO2EO/tWI9P9Igr Jabw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719412196; x=1720016996; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xO1bF8/D+K6VmcsbNZerxp108u1S6w5egvhSyaxchUI=; b=hVEDBoC/HRC8UWmkLaC7ssuvIBK+Op7/DIOaQhljBv2ns8+syrkIZ1l/f/7h2K2M5C XXUlZzViAlgXqRuXz6WpV6DaiED4hS5vkV7KdpeEfxSVFgOS39LpnkpwHtMpyQcyfc/W XSawphlUYRAGFTlQlH1k6XLRUveJi7emD4LqQqU0aEultX3ucBB85BI0zBynDAE/CXfd VZxd6WTCS21oN5wU6A6hkF5+45i8269KWEgLJyV33CCIoszuzZ+pN7OIkinvHeLpy/8K iJRq0CSYwr0tbMIx488rR/LEbhQorzRYv2XTIuEa43Kmk7LmNntgtbpFr18t2tVwU+W0 4jwQ== X-Gm-Message-State: AOJu0Ywo7bhO6OZfAmRdJF+Ta+crI/c+5v1P2poBin6D8xIDJHJs3Quo 4dAIKwEwxP4P9wDJvl+4taMLzL85CrmU7lTDJGNhEpCwlHcNNuPY7uwYuw== X-Google-Smtp-Source: AGHT+IELP5ndCbE+OHwGN60yWzOEk6SgWKRJb8hhfRnt7795RZgtA9t9d/y422WqqbMD2usSHtQIEw== X-Received: by 2002:a2e:9d86:0:b0:2ec:5a0d:b2dd with SMTP id 38308e7fff4ca-2ec5b2f0373mr63734311fa.39.1719412195950; Wed, 26 Jun 2024 07:29:55 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-424c8264891sm28402975e9.25.2024.06.26.07.29.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Jun 2024 07:29:55 -0700 (PDT) Message-Id: <0a82e6b4183e007316a8b027086ad3c960901103.1719412192.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Wed, 26 Jun 2024 14:29:49 +0000 Subject: [PATCH v2 3/5] sparse-index: use strbuf in path_found() Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The path_found() method previously reused strings from the cache entries the calling methods were using. This prevents string manipulation in place and causes some odd reallocation before the final lstat() call in the method. Refactor the method to use strbufs and copy the path into the strbuf, but also only the parent directory and not the whole path. This looks like extra copying when assigning the path to the strbuf, but we save an allocation by dropping the 'tmp' string, and we are "reusing" the copy from 'tmp' to put the data in the strbuf. Signed-off-by: Derrick Stolee --- sparse-index.c | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index de6e727f5c1..fec4f393360 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -440,31 +440,30 @@ void ensure_correct_sparsity(struct index_state *istate) } struct path_found_data { - const char *dirname; - size_t dir_len; + struct strbuf dir; int dir_found; }; #define PATH_FOUND_DATA_INIT { \ + .dir = STRBUF_INIT, \ .dir_found = 1 \ } static void clear_path_found_data(struct path_found_data *data) { - return; + strbuf_release(&data->dir); } static int path_found(const char *path, struct path_found_data *data) { struct stat st; char *newdir; - char *tmp; /* * If dirname corresponds to a directory that doesn't exist, and this * path starts with dirname, then path can't exist. */ - if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len)) + if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len)) return 0; /* @@ -486,17 +485,15 @@ static int path_found(const char *path, struct path_found_data *data) * If path starts with directory (which we already lstat'ed and found), * then no need to lstat parent directory again. */ - if (data->dir_found && data->dirname && - memcmp(path, data->dirname, data->dir_len)) + if (data->dir_found && data->dir.buf && + memcmp(path, data->dir.buf, data->dir.len)) return 0; /* Free previous dirname, and cache path's dirname */ - data->dirname = path; - data->dir_len = newdir - path + 1; + strbuf_reset(&data->dir); + strbuf_add(&data->dir, path, newdir - path + 1); - tmp = xstrndup(path, data->dir_len); - data->dir_found = !lstat(tmp, &st); - free(tmp); + data->dir_found = !lstat(data->dir.buf, &st); return 0; } From patchwork Wed Jun 26 14:29:50 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13713033 Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com [209.85.208.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 734D1186E55 for ; Wed, 26 Jun 2024 14:29:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412201; cv=none; b=VO2pYWQGWxgT2qX1dJqKM+1P2EfmFJnW+EBPYYbReAzkQr5wKC1HNQtsWvKlPz50SdCiMLzeOoDKqt1boY+az+PxpakNpmUJKAzOmacn1rpyJz2TEuOxvtol7lKic9UjHXNQhdabT58WaaoY1IOqlLVVdAjZMUTtmTkY0OFbS+8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412201; c=relaxed/simple; bh=v7pEIAOQacxrKcSmfYu3MMPjIoqKiKMlpNo3w47ds7I=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=BPI8eeFwl//M4frzofh8ZLDC0ChYN1pnJ4YTtBzVikeptH+4ZxiY75jRMobxOoyKYdLZtAzSTE9LXhnOgJo9Hrvde300yiBNripLDlYNK0oaB4bao9rpxsG1aak1DG8yoniv1yqlKgnGHMzXJRkf/NFV+xiqNuJHlnk5XYH/4hg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=S4Sq93rm; arc=none smtp.client-ip=209.85.208.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="S4Sq93rm" Received: by mail-lj1-f181.google.com with SMTP id 38308e7fff4ca-2eaafda3b5cso69877061fa.3 for ; Wed, 26 Jun 2024 07:29:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719412197; x=1720016997; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=KauGz/GdUXdGho5U7fUDm2FdMS4nlt8MegpX3RmXHxs=; b=S4Sq93rm408vhBVlcesgUpYRqYlU/CRpDvaDbXHLGiQecrnkhr65xUm1r6CBlJTCs7 kcsbODEtjm4h3MPJGiqxzuhf+yvF/c6nlamOiXgtILjp0PwMgE0DifrA7AGeyA91bM/e vRYnz6EfQs5vX+0We/cbMIenGC1kSq+IYRclxg0SGQx3zbol+fqPblBFT1ICSbrQHgih am6k9Ojg87RXqb6LNFdsKkXVCd5NehlSkiMt2IXaIykNh6E/FW0dGfdiqpbgtAgT8qKh 7wjtrLesxGbHsB8XRIFva6nBRovyzB0pr2NVUj+gMHtwpjZjLnC4IQSUrHuD1u3Pcdx6 /Wyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719412197; x=1720016997; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KauGz/GdUXdGho5U7fUDm2FdMS4nlt8MegpX3RmXHxs=; b=SYaILLu/fMsgjXu+Mygvp3SOYA6diLrupBcbdgU0sdrOSXQwJ2m0SWf0GVYleNl0AH rhMqNtFFSiJ76Qn+CT9ghj3UmmgutFBtr3mzWIpELerpEGTWk8mIk7c6kLA8zJTgrY0F bWStF90QtWtJkawjC8j52h1HG6RE6mXkte/Wg/vdrVQ17L8OFjMRw2bk+jW3ZQiReBPt wfPjU/QvKm1ewBDzpvkFi6PUiUwvExcVa+itEz0h+Ohc65QH9qRY1klQwMB99u/QODbK HVSCxzgODP5JXbUQ49f06MJT4+d9BRUibUdmc6OBBWT6KJAHlNqQbZVRzEuCjJX6p4Oo ok0w== X-Gm-Message-State: AOJu0Yy6Xj1JVPAWhrme7R97h7Tal+EpZVv6KL9/0et+FZSGfo0yu5KV p5weRKf1Qp+Q50vqvpZBbZhYYG/5gnaZX8DrVbXvAgpDMdD1gj5UaiNO5g== X-Google-Smtp-Source: AGHT+IHHNdLFeyRuFVru0UuusjqQ+WOpVFIJvdU1nGni+FLPjpGqYv7wnvWONvG01rd3iZyC9/i4Uw== X-Received: by 2002:a2e:91c7:0:b0:2ec:5172:dbb8 with SMTP id 38308e7fff4ca-2ec593be975mr60289641fa.7.1719412197100; Wed, 26 Jun 2024 07:29:57 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-366fbbac944sm5999592f8f.116.2024.06.26.07.29.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Jun 2024 07:29:56 -0700 (PDT) Message-Id: <9549f5b8062909253fef1fc9a62dae9114c43481.1719412192.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Wed, 26 Jun 2024 14:29:50 +0000 Subject: [PATCH v2 4/5] sparse-index: count lstat() calls Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The clear_skip_worktree.. methods already report some statistics about how many cache entries are checked against path_found() due to having the skip-worktree bit set. However, due to path_found() performing some caching, this isn't the only information that would be helpful to report. Add a new lstat_count member to the path_found_data struct to count the number of times path_found() calls lstat(). This will be helpful to help explain performance problems in this method as well as to demonstrate future changes to the caching algorithm in a more concrete way than end-to-end timings. Signed-off-by: Derrick Stolee --- sparse-index.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/sparse-index.c b/sparse-index.c index fec4f393360..8577fa726b8 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -442,6 +442,7 @@ void ensure_correct_sparsity(struct index_state *istate) struct path_found_data { struct strbuf dir; int dir_found; + size_t lstat_count; }; #define PATH_FOUND_DATA_INIT { \ @@ -469,6 +470,7 @@ static int path_found(const char *path, struct path_found_data *data) /* * If path itself exists, return 1. */ + data->lstat_count++; if (!lstat(path, &st)) return 1; @@ -493,6 +495,7 @@ static int path_found(const char *path, struct path_found_data *data) strbuf_reset(&data->dir); strbuf_add(&data->dir, path, newdir - path + 1); + data->lstat_count++; data->dir_found = !lstat(data->dir.buf, &st); return 0; @@ -524,6 +527,8 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist trace2_data_intmax("index", istate->repo, "sparse_path_count", path_count); + trace2_data_intmax("index", istate->repo, + "sparse_lstat_count", data.lstat_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse", istate->repo); clear_path_found_data(&data); @@ -553,6 +558,8 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista trace2_data_intmax("index", istate->repo, "full_path_count", path_count); + trace2_data_intmax("index", istate->repo, + "full_lstat_count", data.lstat_count); trace2_region_leave("index", "clear_skip_worktree_from_present_files_full", istate->repo); clear_path_found_data(&data); From patchwork Wed Jun 26 14:29:51 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13713034 Received: from mail-wr1-f41.google.com (mail-wr1-f41.google.com [209.85.221.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B4EDE187334 for ; Wed, 26 Jun 2024 14:30:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.41 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412202; cv=none; b=e8MeyUu+fZcZk/3M1xdXjpfGhRqtt0e9DuclKNvCDVrzEv5iq1pBIfNfYdOdUVZTu7F3xhmJ5ofqUqZi8bi3gdSvkbP0BSPol5syVkTK+elMnRUqIK7Ft1zlzVV0mgUpgoNwEcX8MYEPQm3WbBuB52Ov4zSu5rP9U55IT57CJV0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719412202; c=relaxed/simple; bh=DKdsyken6hSuTZ7OWFxnzzYgmY3pk6kxwlrSYyrHy1w=; h=Message-Id:In-Reply-To:References:From:Date:Subject:MIME-Version: Content-Type:To:Cc; b=D/lbo5tcaw4XwuDAIlJ83xUQ81y6huUBegi7mtbpowXFXhVlJmH8GWVBWq8Pn2mbjoogqdv2ATVwPB7aiBhVPZqQ8+j4KGFjMiYLI0i8kI6tyMR5madTNMPh2ZcZQOR9DM1pSuoNSu6NnSFzMSJxBcd8POWFxKotY80tuEOAfpQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=F1+PSoMq; arc=none smtp.client-ip=209.85.221.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="F1+PSoMq" Received: by mail-wr1-f41.google.com with SMTP id ffacd0b85a97d-35f2c9e23d3so434151f8f.0 for ; Wed, 26 Jun 2024 07:30:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719412199; x=1720016999; darn=vger.kernel.org; h=cc:to:fcc:content-transfer-encoding:mime-version:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=KPk9QmR2Hw4pTGQX5+tllWwY8IL4/dZICpZqmT+9aEQ=; b=F1+PSoMqA9DernTCl6Ys4y3iQS1Dn9V/yhooRghZK1yWxweAckiq0Rs4WTS+s3cuor uCGLpAIZLqcIeodzLE8OH66rfz6G8idQtqkkF7cdsXodzE5DYqWqqFI3Sw0A9CbyqmkU 77hxZZOhAAQxzTwiWuL+JwaKCNYQBLrXISx8lCq+nAlxjwKj/SpjluNjVMnG6g1WuQuY Ki/Iw1q91j5fbyFs60uqpn2MSA5e/OrRLBZrNLW/F+/nEYoNsGCpcT8KsLhUK5Sg/gg7 EBWHODJJN2b8reDx0YEcvU+OjXLMTPjejuo1BiX+MrmJaaX0Jxh+RBK8GHocqhBtTNC2 gh/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719412199; x=1720016999; h=cc:to:fcc:content-transfer-encoding:mime-version:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KPk9QmR2Hw4pTGQX5+tllWwY8IL4/dZICpZqmT+9aEQ=; b=c5PPrMzOH4peuX1VW14tq2qrVDnZxl7LA1xtQ9mvm6iXKQk9vZLRTXhtpO+cJfgwEk jrGFfk4JDLPuzkjaiiQuY5knoWPkKhiy+NrimrxbyEqiC+yXgZkNbiRT5053GGrqA8gb smCdji0+6i1j6lpFjE229BQI+ZPUBrd4ifkLdyXu5hVssxilvx0Q9rCBdA1pLK0POA7I qII9JvBwFMVSMCeDz/q8Afqf2eiZy0rg8L6hoh/7/u4Rl9ZW01l7foVW7QnqfKaHuCY0 2LmQT/5ZBUSnotEUeHx3iC3RUMmhDNk4hoFZNVqxzWWwGoRjUN7sPzKrbAHfG38Mkw/2 m2Fg== X-Gm-Message-State: AOJu0YziWjXCsmsEvzNkfuSZipb+gybpPAjYY/Yco0FWd5QTvsFZDBOr cTO1NUrc4rDjahw42cUob1YIZVEUhHGqvFTv83pczhPXLRdwNu5OAfNDHg== X-Google-Smtp-Source: AGHT+IHHzPVG5WD39QmIKJ7oQTKlpckBsrQLeoFjNei/stgy3YzHykjBQnZu//ILbpCjHXmhhs5thA== X-Received: by 2002:a05:6000:542:b0:362:d0f7:1e2d with SMTP id ffacd0b85a97d-366e32f7094mr9638463f8f.22.1719412198385; Wed, 26 Jun 2024 07:29:58 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-366383f68acsm16078101f8f.2.2024.06.26.07.29.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Jun 2024 07:29:57 -0700 (PDT) Message-Id: <0cb344ac14fa942f781221663e2324fd2225fb5f.1719412192.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Wed, 26 Jun 2024 14:29:51 +0000 Subject: [PATCH v2 5/5] sparse-index: improve lstat caching of sparse paths Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The clear_skip_worktree_from_present_files() method was first introduced in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14) to allow better interaction with the working directory in the presence of paths outside of the sparse-checkout. The initial implementation would lstat() every single SKIP_WORKTREE path to see if it existed; if it ran across a sparse directory that existed (when a sparse index was in use), then it would expand the index and then check every SKIP_WORKTREE path. Since these lstat() calls were very expensive, this was improved in d79d299352 (Accelerate clear_skip_worktree_from_present_files() by caching, 2022-01-14) by caching directories that do not exist so it could avoid lstat()ing any files under such directories. However, there are some inefficiencies in that caching mechanism. The caching mechanism stored only the parent directory as not existing, even if a higher parent directory also does not exist. This means that wasted lstat() calls would occur when the paths passed to path_found() change immediate parent directories but within the same parent directory that does not exist. To create an example repository that demonstrates this problem, it helps to have a directory outside of the sparse-checkout that contains many deep paths. In particular, the first paths (in lexicographic order) underneath the sparse directory should have deep directory structures, maximizing the difference between the old caching algorithm that looks to a single parent and the new caching algorithm that looks to the top-most missing directory. The performance test script p2000-sparse-operations.sh takes the sample repository and copies its HEAD to several copies nested in directories of the form f/f/f where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior and also lead to some interesting cases for the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do not. This is difficult to notice when running performance tests using the Git repository (or a blow-up of the Git repository, as in p2000-sparse-operations.sh) because Git has a very shallow directory structure. This change reorganizes the caching algorithm to focus on storing the highest level leading directory that does not exist; specifically this means that that directory's parent _does_ exist. By doing a little extra work on a path passed to path_found(), we can short-circuit all of the paths passed to path_found() afterwards that match a prefix with that non-existing directory. When in a repository where the first sparse file is likely to have a much deeper path than the first non-existing directory, this can realize significant gains. The details of this algorithm require careful attention, so the new implementation of path_found() has detailed comments, including the use of a new max_common_dir_prefix() method that may be of independent interest. It's worth noting that this is not universally positive, since we are doing extra lstat() calls to establish the exact path to cache. In the blow-up of the Git repository, we can see that the lstat count _increases_ from 28 to 31. However, these numbers were already artificially low. Contributor Elijah Newren created a publicly-available test repository that demonstrates the difference in these caching algorithms in the most extreme way. To test, follow these steps: git clone --sparse https://github.com/newren/gvfs-like-git-bomb cd gvfs-like-git-bomb ./runme.sh # NOTE: check scripts before running! At this point, assuming you do not have index.sparse=true set globally, the index has one million paths with the SKIP_WORKTREE bit and they will all be sent to path_found() in the sparse loop. You can measure this by running 'git status' with GIT_TRACE2_PERF=1: Sparse files in the index: 1,000,000 sparse_lstat_count (before): 200,000 sparse_lstat_count (after): 2 And here are the performance numbers: Benchmark 1: old Time (mean ± σ): 397.5 ms ± 4.1 ms Range (min … max): 391.2 ms … 404.8 ms 10 runs Benchmark 2: new Time (mean ± σ): 252.7 ms ± 3.1 ms Range (min … max): 249.4 ms … 259.5 ms 11 runs Summary 'new' ran 1.57 ± 0.02 times faster than 'old' By modifying this example further, we can demonstrate a more realistic example and include the sparse index expansion. Continue by creating this directory, confusing both caching algorithms somewhat: mkdir -p bomb/d/e/f/a/a Then re-run the 'git status' tests to see these statistics: Sparse files in the index: 1,000,000 sparse_lstat_count (before): 724,010 sparse_lstat_count (after): 106 Benchmark 1: old Time (mean ± σ): 753.0 ms ± 3.5 ms Range (min … max): 749.7 ms … 760.9 ms 10 runs Benchmark 2: new Time (mean ± σ): 201.4 ms ± 3.2 ms Range (min … max): 196.0 ms … 207.9 ms 14 runs Summary 'new' ran 3.74 ± 0.06 times faster than 'old' Note that if this repository had a sparse index enabled, the additional cost of expanding the sparse index affects the total time of these commands by over four seconds, significantly diminishing the benefit of the caching algorithm. Having existing paths outside of the sparse-checkout is a known performance issue for the sparse index and is a known trade-off for the performance benefits given when no such paths exist. Using an internal monorepo with over two million paths at HEAD and a typical sparse-checkout cone such that the sparse index contains ~190,000 entries (including over two thousand sparse trees), I was able to measure these lstat counts when one sparse directory actually exists on disk: Sparse files in expanded index: 1,841,997 full_lstat_count (before): 1,188,161 full_lstat_count (after): 4,404 This resulted in this absolute time change, on a warm disk: Time in full loop (before): 13.481 s Time in full loop (after): 0.081 s (These times were calculated on a Windows machine, where lstat() is slower than a similar Linux machine.) Helped-by: Elijah Newren Signed-off-by: Derrick Stolee --- sparse-index.c | 114 ++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 90 insertions(+), 24 deletions(-) diff --git a/sparse-index.c b/sparse-index.c index 8577fa726b8..7e433250248 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -440,14 +440,21 @@ void ensure_correct_sparsity(struct index_state *istate) } struct path_found_data { + /** + * The path stored in 'dir', if non-empty, corresponds to the most- + * recent path that we checked where: + * + * 1. The path should be a directory, according to the index. + * 2. The path does not exist. + * 3. The parent path _does_ exist. (This may be the root of the + * working directory.) + */ struct strbuf dir; - int dir_found; size_t lstat_count; }; #define PATH_FOUND_DATA_INIT { \ - .dir = STRBUF_INIT, \ - .dir_found = 1 \ + .dir = STRBUF_INIT \ } static void clear_path_found_data(struct path_found_data *data) @@ -455,49 +462,108 @@ static void clear_path_found_data(struct path_found_data *data) strbuf_release(&data->dir); } +/** + * Return the length of the largest common substring that ends in a + * slash ('/') to indicate the largest common parent directory. Returns + * zero if no common directory exists. + */ +static size_t max_common_dir_prefix(const char *path1, const char *path2) +{ + size_t common_prefix = 0; + for (size_t i = 0; path1[i] && path2[i]; i++) { + if (path1[i] != path2[i]) + break; + + /* + * If they agree at a directory separator, then add one + * to make sure it is included in the common prefix string. + */ + if (path1[i] == '/') + common_prefix = i + 1; + } + + return common_prefix; +} + static int path_found(const char *path, struct path_found_data *data) { struct stat st; - char *newdir; + size_t common_prefix; /* - * If dirname corresponds to a directory that doesn't exist, and this - * path starts with dirname, then path can't exist. + * If data->dir is non-empty, then it contains a path that doesn't + * exist, including an ending slash ('/'). If it is a prefix of 'path', + * then we can return 0. */ - if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len)) + if (data->dir.len && !memcmp(path, data->dir.buf, data->dir.len)) return 0; /* - * If path itself exists, return 1. + * Otherwise, we must check if the current path exists. If it does, then + * return 1. The cached directory will be skipped until we come across + * a missing path again. */ data->lstat_count++; if (!lstat(path, &st)) return 1; /* - * Otherwise, path does not exist so we'll return 0...but we'll first - * determine some info about its parent directory so we can avoid - * lstat calls for future cache entries. + * At this point, we know that 'path' doesn't exist, and we know that + * the parent directory of 'data->dir' does exist. Let's set 'data->dir' + * to be the top-most non-existing directory of 'path'. If the first + * parent of 'path' exists, then we will act as though 'path' + * corresponds to a directory (by adding a slash). */ - newdir = strrchr(path, '/'); - if (!newdir) - return 0; /* Didn't find a parent dir; just return 0 now. */ + common_prefix = max_common_dir_prefix(path, data->dir.buf); /* - * If path starts with directory (which we already lstat'ed and found), - * then no need to lstat parent directory again. + * At this point, 'path' and 'data->dir' have a common existing parent + * directory given by path[0..common_prefix] (which could have length 0). + * We "grow" the data->dir buffer by checking for existing directories + * along 'path'. */ - if (data->dir_found && data->dir.buf && - memcmp(path, data->dir.buf, data->dir.len)) - return 0; - /* Free previous dirname, and cache path's dirname */ - strbuf_reset(&data->dir); - strbuf_add(&data->dir, path, newdir - path + 1); + strbuf_setlen(&data->dir, common_prefix); + while (1) { + /* Find the next directory in 'path'. */ + const char *rest = path + data->dir.len; + const char *next_slash = strchr(rest, '/'); - data->lstat_count++; - data->dir_found = !lstat(data->dir.buf, &st); + /* + * If there are no more slashes, then 'path' doesn't contain a + * non-existent _parent_ directory. Set 'data->dir' to be equal + * to 'path' plus an additional slash, so it can be used for + * caching in the future. The filename of 'path' is considered + * a non-existent directory. + * + * Note: if "{path}/" exists as a directory, then it will never + * appear as a prefix of other callers to this method, assuming + * the context from the clear_skip_worktree... methods. If this + * method is reused, then this must be reconsidered. + */ + if (!next_slash) { + strbuf_addstr(&data->dir, rest); + strbuf_addch(&data->dir, '/'); + break; + } + /* + * Now that we have a slash, let's grow 'data->dir' to include + * this slash, then test if we should stop. + */ + strbuf_add(&data->dir, rest, next_slash - rest + 1); + + /* If the parent dir doesn't exist, then stop here. */ + data->lstat_count++; + if (lstat(data->dir.buf, &st)) + return 0; + } + + /* + * At this point, 'data->dir' is equal to 'path' plus a slash character, + * and the parent directory of 'path' definitely exists. Moreover, we + * know that 'path' doesn't exist, or we would have returned 1 earlier. + */ return 0; }