From patchwork Mon Feb 3 17:11:03 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13957881 Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 21E7A20B7ED for ; Mon, 3 Feb 2025 17:11:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602675; cv=none; b=IDFwuJTiaUDuo0Uo0C9tzNcS/r5Cf4UDppKnw8mgi6YlUYvHjz062jy1hisWGDlNZEmTbC1lVSM5U2bqRH30muBN0qEqINxoXMaSKdsW0gwhBei8yFdFpWp6azku0L6OJtSRRuDLCDHEw8vllwBSEkHG/IAfS8yk2QZrAN9YAvg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602675; c=relaxed/simple; bh=o2CIeyO1LecBdsa/v+T9PszOgzJd9XKGL9q5ebU5aeM=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=ohwq1/fG4snrX9XkuRZRgwrDjOQV0F15wL9jp3Lmd26Hc2WxV+qCp9bldKu4GkI96pLTqMwn1nob0uYSHFvQFdukyU+q+B6vYSzDNNC2eDxfi9Nb6h78kJrsFnMjVXYQh839+cNirRdqPeQR9j/wORG+y617EE4mub5saNhVqKo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=UpkaAm+Q; arc=none smtp.client-ip=209.85.208.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UpkaAm+Q" Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-5d3f28a4fccso6935501a12.2 for ; Mon, 03 Feb 2025 09:11:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738602671; x=1739207471; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=/ew5OHsOQw9NetfEXm1XCGvae0ZhzF+qsLTiD9c4WbQ=; b=UpkaAm+Qjb9L0USih6GtRkikagbC3PCI0iOjfzxUbczmCqbRPBWDe6IxhqfdNbIs9e gOsyJetJnq9rZYvM1fY/hRiCxQfIifrdxeEWCqagIgXVAbdUqwvSwYuqqoKIDZdxgxbS Kidbz/3sGyAyrixjTMwmQrcDoEz/L1suFPYDj6uPgSM8g42oi/0tOe1PHCpZy9BN6cvy LEajn5s48isZ2qzv9ShDjEuU/oVghPHRPeUhfay/wRzUxZg72agjH30OOrr5+3aX3ZrJ Qg2u1Zs/1tPjbc3U6Ec3pRHvLbpH6TI7Lv6mOsfoYQ7SZzFrVqye2e29NvT7leQDcbLF Ag8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738602671; x=1739207471; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/ew5OHsOQw9NetfEXm1XCGvae0ZhzF+qsLTiD9c4WbQ=; b=XSRaG+R2Ci/4F6ZWPVpUgaWk0HTzOgECDuQwycKoQfz/M5BzazpkH9CbDW4/Ru2g2u j6EmTpg9WhiRBaXitzzF1M4cZAJ6nNc7Euar9xW8XWbReZPCblkoc5KtMuYb4Law4rxE i99LIfXvdbxUIc5y/IR9tPRlyWg5IRxCoK/5ZmrUcF5r/ZMU4ZM9v5kHoj/AU9iim/WR ZwwltFEGxsA8DhvDNnNc3EVQB5YIQUezFHt4wMg/UHBKNvCIQzVnxFxeudpc7dwF8hU7 rmixtgeh0JIR/HnwHHE575Das96jip+9MTXewOFXrfWNoVFCc58hYVZDLgko/8wQTvDx 7Dbg== X-Gm-Message-State: AOJu0YzbBatb/oK+3wX4iu8losE7EUHCCxsJqhk7pgijP9lVzQdy6O02 jUZEJBSsmpJw4OmB/cB5grmMoMlaqds2IB1GjUeNeKg9bi+ga+waX/p4Sg== X-Gm-Gg: ASbGncsGIaeFKOVskS+C1y8Xz5pWqwdYmGWL36cWDHllBxfVp71zIRywrJjAh01lUTc MFJFMW76tp+y3AUDZpUCTprmgFO4GoNywPF0JU7KwqB11KXTSjsenq8+ri3GMbUjJb55b2D5h4M 2QzC8/vrdNWxH+7p+sW9I7kwwlBH81Sjn9gl3GCF6qvQQp+aY6IG6Ng1omdNbsO0znA/ncgQ6jm z5hCUZfwRhPYJaZ5GoLLadyupi1BfoAOisgcLTdObEDNVAu3TpTtjN/hSeZiima2Q7uVSpeyZZC yJHkRjIoCKnZc493 X-Google-Smtp-Source: AGHT+IH5iw2XS7pWommTJkeQwGCmeeh1mSiIWerdUDg4qTfkBwq5YaQMW5u5NWXFwZtAFrAEMxPgzA== X-Received: by 2002:a05:6402:26c9:b0:5d4:1ac2:277f with SMTP id 4fb4d7f45d1cf-5dc5efc1b98mr25134809a12.9.1738602670575; Mon, 03 Feb 2025 09:11:10 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5dc724c9d0asm8042269a12.77.2025.02.03.09.11.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Feb 2025 09:11:10 -0800 (PST) Message-Id: <1612ad924556974e41587b441ccf886d74963048.1738602667.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Mon, 03 Feb 2025 17:11:03 +0000 Subject: [PATCH v3 1/5] backfill: add builtin boilerplate Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im, me@ttaylorr.com, johncai86@gmail.com, newren@gmail.com, christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com, jonathantanmy@google.com, karthik.188@gmail.com, =?utf-8?q?Jean-No=C3=ABl?= AVILA , Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee In anticipation of implementing 'git backfill', populate the necessary files with the boilerplate of a new builtin. Mark the builtin as experimental at this time, allowing breaking changes in the near future, if necessary. Signed-off-by: Derrick Stolee --- .gitignore | 1 + Documentation/git-backfill.txt | 25 +++++++++++++++++++++++++ Documentation/meson.build | 1 + Makefile | 1 + builtin.h | 1 + builtin/backfill.c | 28 ++++++++++++++++++++++++++++ command-list.txt | 1 + git.c | 1 + meson.build | 1 + 9 files changed, 60 insertions(+) create mode 100644 Documentation/git-backfill.txt create mode 100644 builtin/backfill.c diff --git a/.gitignore b/.gitignore index e82aa19df03..95cd94c5044 100644 --- a/.gitignore +++ b/.gitignore @@ -19,6 +19,7 @@ /git-apply /git-archimport /git-archive +/git-backfill /git-bisect /git-blame /git-branch diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt new file mode 100644 index 00000000000..ab384dad6e4 --- /dev/null +++ b/Documentation/git-backfill.txt @@ -0,0 +1,25 @@ +git-backfill(1) +=============== + +NAME +---- +git-backfill - Download missing objects in a partial clone + + +SYNOPSIS +-------- +[synopsis] +git backfill [] + +DESCRIPTION +----------- + +THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE. + +SEE ALSO +-------- +linkgit:git-clone[1]. + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/meson.build b/Documentation/meson.build index 2a26fa8a5fe..5e9e3e19c5c 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -6,6 +6,7 @@ manpages = { 'git-apply.txt' : 1, 'git-archimport.txt' : 1, 'git-archive.txt' : 1, + 'git-backfill.txt' : 1, 'git-bisect.txt' : 1, 'git-blame.txt' : 1, 'git-branch.txt' : 1, diff --git a/Makefile b/Makefile index 2f6e2d52955..efa199d390c 100644 --- a/Makefile +++ b/Makefile @@ -1203,6 +1203,7 @@ BUILTIN_OBJS += builtin/am.o BUILTIN_OBJS += builtin/annotate.o BUILTIN_OBJS += builtin/apply.o BUILTIN_OBJS += builtin/archive.o +BUILTIN_OBJS += builtin/backfill.o BUILTIN_OBJS += builtin/bisect.o BUILTIN_OBJS += builtin/blame.o BUILTIN_OBJS += builtin/branch.o diff --git a/builtin.h b/builtin.h index f7b166b3348..89928ccf92f 100644 --- a/builtin.h +++ b/builtin.h @@ -120,6 +120,7 @@ int cmd_am(int argc, const char **argv, const char *prefix, struct repository *r int cmd_annotate(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_apply(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_archive(int argc, const char **argv, const char *prefix, struct repository *repo); +int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_bisect(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_blame(int argc, const char **argv, const char *prefix, struct repository *repo); int cmd_branch(int argc, const char **argv, const char *prefix, struct repository *repo); diff --git a/builtin/backfill.c b/builtin/backfill.c new file mode 100644 index 00000000000..58d0866c0fc --- /dev/null +++ b/builtin/backfill.c @@ -0,0 +1,28 @@ +#include "builtin.h" +#include "config.h" +#include "parse-options.h" +#include "repository.h" +#include "object.h" + +static const char * const builtin_backfill_usage[] = { + N_("git backfill []"), + NULL +}; + +int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo) +{ + struct option options[] = { + OPT_END(), + }; + + show_usage_if_asked(argc, argv, builtin_backfill_usage[0]); + + argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage, + 0); + + repo_config(repo, git_default_config, NULL); + + die(_("not implemented")); + + return 0; +} diff --git a/command-list.txt b/command-list.txt index e0bb87b3b5c..c537114b468 100644 --- a/command-list.txt +++ b/command-list.txt @@ -60,6 +60,7 @@ git-annotate ancillaryinterrogators git-apply plumbingmanipulators complete git-archimport foreignscminterface git-archive mainporcelain +git-backfill mainporcelain history git-bisect mainporcelain info git-blame ancillaryinterrogators complete git-branch mainporcelain history diff --git a/git.c b/git.c index a94dab37702..a9b45fcef79 100644 --- a/git.c +++ b/git.c @@ -506,6 +506,7 @@ static struct cmd_struct commands[] = { { "annotate", cmd_annotate, RUN_SETUP }, { "apply", cmd_apply, RUN_SETUP_GENTLY }, { "archive", cmd_archive, RUN_SETUP_GENTLY }, + { "backfill", cmd_backfill, RUN_SETUP }, { "bisect", cmd_bisect, RUN_SETUP }, { "blame", cmd_blame, RUN_SETUP }, { "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG }, diff --git a/meson.build b/meson.build index 548eac62b28..527c015acfa 100644 --- a/meson.build +++ b/meson.build @@ -487,6 +487,7 @@ builtin_sources = [ 'builtin/annotate.c', 'builtin/apply.c', 'builtin/archive.c', + 'builtin/backfill.c', 'builtin/bisect.c', 'builtin/blame.c', 'builtin/branch.c', From patchwork Mon Feb 3 17:11:04 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13957882 Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1E96B20B7F3 for ; Mon, 3 Feb 2025 17:11:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602676; cv=none; b=OsHtMe2NKu5AJ3pj2gpaLH4X/iLLVdo6YqfgqNLC7owT+RK7cC74QDgZPwW+1GKZ1VSFINN6fkPrnAb7Tpd+rFk27e6VftrGdpr+nGWLXf6ZK1W538Q0BvpMCA+YjZq1I5VXQQ/fEj9Lx3438RLsJC491yT8IaIgwhzrkyMfLfo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602676; c=relaxed/simple; bh=7LTRqHtyT8dx8//ep/9TiP1X2Ei2v9oir3Sa2Qldkio=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=RUQlzazf190t5JkwH19nJLJMS4fjnMaoZER2eVUT+vU52iw9lkbXy9e329SbNjCeAoooVnuGhGuNBU4p3j/zfS1S4JgZmB6Z1na0V7D7QF8dEG3cj5KLYhQY5BLEpfEBWwjh+hger7xzUrzGYH4ZXwnCn0SWM017zud6EZT7kDA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=URGrmuOt; arc=none smtp.client-ip=209.85.218.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="URGrmuOt" Received: by mail-ej1-f45.google.com with SMTP id a640c23a62f3a-aaf3c3c104fso910840966b.1 for ; Mon, 03 Feb 2025 09:11:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738602672; x=1739207472; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=d1it4nfRFvtvY1DNch9XTBIeI/iJr9GPk0q3aE2qAlU=; b=URGrmuOtkBh9jFBTv2fDdRtddkNBqtCFv8OlG/RA8+ibC6CNV0pCXhhOqQCk1cqUmi igS5GxA8H4/VaQiLOR3yHBGxyWRos8E980L2QdOPk+NxcSjMpdMUnVJpoZ5j/9LrI2rr +mf8XrHCHMUMvgdreDmBDHdl4aEMKhFvKmMUMIQuI7Ch4phsGbeO7wNxd0QZxPjkavHq ifSLAVfjgK+eToko7QybH9Vy4UIIRdPgrR5VdEtjirapAOMFYl5wjCWA2kmaI+NKZTts biPRsnYSr7+itCEhM//iU+bsFKqpM/oyfb4MTOF0cdYBEGEYi0iznhFfwMiNI+zW6K9g qzBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738602672; x=1739207472; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=d1it4nfRFvtvY1DNch9XTBIeI/iJr9GPk0q3aE2qAlU=; b=ZbUMbB6pehu7V0ZoXq/nXJD4SLo7DOu2+NEdtHib5tLhvEc4IXrrtvxuQYKJjBWtSA v7bzpRKPWj95pHblOlynO+q1xx3Rj9aoKq4ZLknXMBIExfHqUeOLzpAVokzNBqOlBwpU RwqvzPfZJWfLB48VGgwylp0FjlCUG9B3yongPPIhRzGNEwYJcIW+j4NixStsNz6MDtX3 /xX/jSr+jMlHEC0nH0dGbvTwYa6pMRmlAItThZwYddR96dU0ric2kpXzJkybc9u26ILH SWFbTPeGv/jGiTWzY9hDM5R+FhjuAgX4ssMH/JxIEntJRHe+4oRgtkOxDuO14KNaGSlG IwiQ== X-Gm-Message-State: AOJu0Yy4a5U7i+J9VftNBJ2ruswV0b9BN3UAvzT2BbsKlMs2FKOCgmul OiTwQGwucuUTe3Hcxu9TN9By5zn8gkm4BInktsSNwfusgYXBTwrB/l2LoA== X-Gm-Gg: ASbGncv0jdqDkKaaCM9ZtR8wrqkhgrU8O5WTk6Hdgvw0pK2f+5MxrG6id4h4P1y0C7M mL9VDjiXdUfx6rh1qLhwjpSExIYVo3H7h7yeOY6NLhv+IXBH1liR47vIvgODZjvLDBV9BeIkdP9 cXIi566po898M8LH6oZRcHUrFo30y9DewqGKuE+ZO2y79AVpEgX7+4hqv/vxmLVNDHq1YVYOpdB C7wz1znWFPJhbMXApKV9cJpAk4CXnQjzFfTWfXrZDrk+aSMYwp6f5m/f9vxSKuIDknhnMVb5M37 8GenWLhp23/zxmS1 X-Google-Smtp-Source: AGHT+IHpCaRo+vNF6RDrWUBbNiPTj4WwkYGxFjLSQIFVAr78MUWohgE5M5a0ZT9dfoHctxLhyaEqLg== X-Received: by 2002:a17:907:3608:b0:ab7:835:1ece with SMTP id a640c23a62f3a-ab708351f8amr1537678966b.0.1738602671470; Mon, 03 Feb 2025 09:11:11 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ab6e47a818csm785467666b.6.2025.02.03.09.11.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Feb 2025 09:11:11 -0800 (PST) Message-Id: In-Reply-To: References: Date: Mon, 03 Feb 2025 17:11:04 +0000 Subject: [PATCH v3 2/5] backfill: basic functionality and tests Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im, me@ttaylorr.com, johncai86@gmail.com, newren@gmail.com, christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com, jonathantanmy@google.com, karthik.188@gmail.com, =?utf-8?q?Jean-No=C3=ABl?= AVILA , Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The default behavior of 'git backfill' is to fetch all missing blobs that are reachable from HEAD. Document and test this behavior. The implementation is a very simple use of the path-walk API, initializing the revision walk at HEAD to start the path-walk from all commits reachable from HEAD. Ignore the object arrays that correspond to tree entries, assuming that they are all present already. The path-walk API provides lists of objects in batches according to a common path, but that list could be very small. We want to balance the number of requests to the server with the ability to have the process interrupted with minimal repeated work to catch up in the next run. Based on some experiments (detailed in the next change) a minimum batch size of 50,000 is selected for the default. This batch size is a _minimum_. As the path-walk API emits lists of blob IDs, they are collected into a list of objects for a request to the server. When that list is at least the minimum batch size, then the request is sent to the server for the new objects. However, the list of blob IDs from the path-walk API could be much longer than the batch size. At this moment, it is unclear if there is a benefit to split the list when there are too many objects at the same path. Signed-off-by: Derrick Stolee --- Documentation/git-backfill.txt | 31 +++++++ Documentation/technical/api-path-walk.txt | 3 +- builtin/backfill.c | 102 +++++++++++++++++++++- t/meson.build | 1 + t/t5620-backfill.sh | 94 ++++++++++++++++++++ 5 files changed, 227 insertions(+), 4 deletions(-) create mode 100755 t/t5620-backfill.sh diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt index ab384dad6e4..56cbb9ffd82 100644 --- a/Documentation/git-backfill.txt +++ b/Documentation/git-backfill.txt @@ -14,6 +14,37 @@ git backfill [] DESCRIPTION ----------- +Blobless partial clones are created using `git clone --filter=blob:none` +and then configure the local repository such that the Git client avoids +downloading blob objects unless they are required for a local operation. +This initially means that the clone and later fetches download reachable +commits and trees but no blobs. Later operations that change the `HEAD` +pointer, such as `git checkout` or `git merge`, may need to download +missing blobs in order to complete their operation. + +In the worst cases, commands that compute blob diffs, such as `git blame`, +become very slow as they download the missing blobs in single-blob +requests to satisfy the missing object as the Git command needs it. This +leads to multiple download requests and no ability for the Git server to +provide delta compression across those objects. + +The `git backfill` command provides a way for the user to request that +Git downloads the missing blobs (with optional filters) such that the +missing blobs representing historical versions of files can be downloaded +in batches. The `backfill` command attempts to optimize the request by +grouping blobs that appear at the same path, hopefully leading to good +delta compression in the packfile sent by the server. + +In this way, `git backfill` provides a mechanism to break a large clone +into smaller chunks. Starting with a blobless partial clone with `git +clone --filter=blob:none` and then running `git backfill` in the local +repository provides a way to download all reachable objects in several +smaller network calls than downloading the entire repository at clone +time. + +By default, `git backfill` downloads all blobs reachable from the `HEAD` +commit. This set can be restricted or expanded using various options. + THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE. SEE ALSO diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt index 7075d0d5ab5..1fba0ce04cb 100644 --- a/Documentation/technical/api-path-walk.txt +++ b/Documentation/technical/api-path-walk.txt @@ -60,4 +60,5 @@ Examples -------- See example usages in: - `t/helper/test-path-walk.c` + `t/helper/test-path-walk.c`, + `builtin/backfill.c` diff --git a/builtin/backfill.c b/builtin/backfill.c index 58d0866c0fc..0eca175a7fe 100644 --- a/builtin/backfill.c +++ b/builtin/backfill.c @@ -1,16 +1,112 @@ #include "builtin.h" +#include "git-compat-util.h" #include "config.h" #include "parse-options.h" #include "repository.h" +#include "commit.h" +#include "hex.h" +#include "tree.h" +#include "tree-walk.h" #include "object.h" +#include "object-store-ll.h" +#include "oid-array.h" +#include "oidset.h" +#include "promisor-remote.h" +#include "strmap.h" +#include "string-list.h" +#include "revision.h" +#include "trace2.h" +#include "progress.h" +#include "packfile.h" +#include "path-walk.h" static const char * const builtin_backfill_usage[] = { N_("git backfill []"), NULL }; +struct backfill_context { + struct repository *repo; + struct oid_array current_batch; + size_t min_batch_size; +}; + +static void backfill_context_clear(struct backfill_context *ctx) +{ + oid_array_clear(&ctx->current_batch); +} + +static void download_batch(struct backfill_context *ctx) +{ + promisor_remote_get_direct(ctx->repo, + ctx->current_batch.oid, + ctx->current_batch.nr); + oid_array_clear(&ctx->current_batch); + + /* + * We likely have a new packfile. Add it to the packed list to + * avoid possible duplicate downloads of the same objects. + */ + reprepare_packed_git(ctx->repo); +} + +static int fill_missing_blobs(const char *path UNUSED, + struct oid_array *list, + enum object_type type, + void *data) +{ + struct backfill_context *ctx = data; + + if (type != OBJ_BLOB) + return 0; + + for (size_t i = 0; i < list->nr; i++) { + if (!has_object(ctx->repo, &list->oid[i], + OBJECT_INFO_FOR_PREFETCH)) + oid_array_append(&ctx->current_batch, &list->oid[i]); + } + + if (ctx->current_batch.nr >= ctx->min_batch_size) + download_batch(ctx); + + return 0; +} + +static int do_backfill(struct backfill_context *ctx) +{ + struct rev_info revs; + struct path_walk_info info = PATH_WALK_INFO_INIT; + int ret; + + repo_init_revisions(ctx->repo, &revs, ""); + handle_revision_arg("HEAD", &revs, 0, 0); + + info.blobs = 1; + info.tags = info.commits = info.trees = 0; + + info.revs = &revs; + info.path_fn = fill_missing_blobs; + info.path_fn_data = ctx; + + ret = walk_objects_by_path(&info); + + /* Download the objects that did not fill a batch. */ + if (!ret) + download_batch(ctx); + + path_walk_info_clear(&info); + release_revisions(&revs); + return ret; +} + int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo) { + int result; + struct backfill_context ctx = { + .repo = repo, + .current_batch = OID_ARRAY_INIT, + .min_batch_size = 50000, + }; struct option options[] = { OPT_END(), }; @@ -22,7 +118,7 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit repo_config(repo, git_default_config, NULL); - die(_("not implemented")); - - return 0; + result = do_backfill(&ctx); + backfill_context_clear(&ctx); + return result; } diff --git a/t/meson.build b/t/meson.build index 35f25ca4a1d..af53e8ee583 100644 --- a/t/meson.build +++ b/t/meson.build @@ -721,6 +721,7 @@ integration_tests = [ 't5617-clone-submodules-remote.sh', 't5618-alternate-refs.sh', 't5619-clone-local-ambiguous-transport.sh', + 't5620-backfill.sh', 't5700-protocol-v1.sh', 't5701-git-serve.sh', 't5702-protocol-v2.sh', diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh new file mode 100755 index 00000000000..64326362d80 --- /dev/null +++ b/t/t5620-backfill.sh @@ -0,0 +1,94 @@ +#!/bin/sh + +test_description='git backfill on partial clones' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./test-lib.sh + +# We create objects in the 'src' repo. +test_expect_success 'setup repo for object creation' ' + echo "{print \$1}" >print_1.awk && + echo "{print \$2}" >print_2.awk && + + git init src && + + mkdir -p src/a/b/c && + mkdir -p src/d/e && + + for i in 1 2 + do + for n in 1 2 3 4 + do + echo "Version $i of file $n" > src/file.$n.txt && + echo "Version $i of file a/$n" > src/a/file.$n.txt && + echo "Version $i of file a/b/$n" > src/a/b/file.$n.txt && + echo "Version $i of file a/b/c/$n" > src/a/b/c/file.$n.txt && + echo "Version $i of file d/$n" > src/d/file.$n.txt && + echo "Version $i of file d/e/$n" > src/d/e/file.$n.txt && + git -C src add . && + git -C src commit -m "Iteration $n" || return 1 + done + done +' + +# Clone 'src' into 'srv.bare' so we have a bare repo to be our origin +# server for the partial clone. +test_expect_success 'setup bare clone for server' ' + git clone --bare "file://$(pwd)/src" srv.bare && + git -C srv.bare config --local uploadpack.allowfilter 1 && + git -C srv.bare config --local uploadpack.allowanysha1inwant 1 +' + +# do basic partial clone from "srv.bare" +test_expect_success 'do partial clone 1, backfill gets all objects' ' + git clone --no-checkout --filter=blob:none \ + --single-branch --branch=main \ + "file://$(pwd)/srv.bare" backfill1 && + + # Backfill with no options gets everything reachable from HEAD. + GIT_TRACE2_EVENT="$(pwd)/backfill-file-trace" git \ + -C backfill1 backfill && + + # We should have engaged the partial clone machinery + test_trace2_data promisor fetch_count 48 revs2 && + test_line_count = 0 revs2 +' + +. "$TEST_DIRECTORY"/lib-httpd.sh +start_httpd + +test_expect_success 'create a partial clone over HTTP' ' + SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" && + rm -rf "$SERVER" repo && + git clone --bare "file://$(pwd)/src" "$SERVER" && + test_config -C "$SERVER" uploadpack.allowfilter 1 && + test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 && + + git clone --no-checkout --filter=blob:none \ + "$HTTPD_URL/smart/server" backfill-http +' + +test_expect_success 'backfilling over HTTP succeeds' ' + GIT_TRACE2_EVENT="$(pwd)/backfill-http-trace" git \ + -C backfill-http backfill && + + # We should have engaged the partial clone machinery + test_trace2_data promisor fetch_count 48 rev-list-out && + awk "{print \$1;}" oids && + GIT_TRACE2_EVENT="$(pwd)/walk-trace" git -C backfill-http \ + cat-file --batch-check batch-out && + ! grep missing batch-out +' + +# DO NOT add non-httpd-specific tests here, because the last part of this +# test script is only executed when httpd is available and enabled. + +test_done From patchwork Mon Feb 3 17:11:05 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13957883 Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC8CF20C46A for ; Mon, 3 Feb 2025 17:11:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602677; cv=none; b=alObb8jdevByHZosQ+mQfjrdFNLrAxhmD9lV40wqXLgP5q1jc779wN5+W0UUVEglxY/Wja/D5+WyP8PewD00ItVPvlsWe9dOr2JDGKiIR2dm5CkPb7By4keQWsPEom4YZ6A4hy1Pum0uQJ/FMV2CszV6ecPP7uFXacuavTXypf8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602677; c=relaxed/simple; bh=pQ5cC/MIwIkqzfPR53W2C8+S8l8I7Vh2oWFz0IavHfc=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=fY0k8Nod17ZjEJojPG3xXdraXaaSTn/oMx/A5IYcDCxuRwnks0LQqKx/F2006oJPEDReIrhRoBssb699QIOJ/yGKiQT8KANAgrU32L4lAsbgqivCqavWcI67Q4X2LiFUX1YGFfBXqvAsYMOC4YTzbQZthtdRQu4UeO29THLki1Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=JfV2zE94; arc=none smtp.client-ip=209.85.208.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="JfV2zE94" Received: by mail-ed1-f44.google.com with SMTP id 4fb4d7f45d1cf-5d3f57582a2so11265351a12.1 for ; Mon, 03 Feb 2025 09:11:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738602672; x=1739207472; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=dwmLMOg6kqF/7d4afWMf0LV6sb7oQoooYIjw5ZWsJPA=; b=JfV2zE94LfNSu28ramzvq5jcGiWAEugzQHssC32L62e029Jj3qGorfyKyMZSNqChER nLmhfieTRA1jnGDSfWy6QtlFZn+l8Xc4Q+IsmkLywFv3cqMfweONGVXiSDXS9XIYpVet 2xtry7lW+88VKC/n0Mr551IxCHtOAu2ReIubmaN8bISfwksToWB7G20GpdWhLSPwlETe z8P8z2ccaF0hiDbNDXbL80ZHrPqT+/nM9FxHqYNKG0mVP82wj23zA0QRpeH6exn6BQFN FjKjc/zY03pQU7qduULmTQ8mSpO3DodJgktWPIjDb8RyCTCUNfTbXnzTkbdoA+Df6Lp3 C1Lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738602672; x=1739207472; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dwmLMOg6kqF/7d4afWMf0LV6sb7oQoooYIjw5ZWsJPA=; b=KyNqswxU5ntwHtBKauNH0QoBA69VuvXJzovcrFRmyucgQJGeJg9QkQgOPN2SRBdHic VKj8iB5XJfn/uKeOkPF+lWytwd8Wdc0AGl/bWzjuRBJZMrjfKvg7fZ1Rpw7BkYZfjHqg rfCG2r2gvUC/ik+GGshFGFF4r+qBP5bq29dMb3bpnNEV1rZ9o0vjqTB70LTZmYrtfIyg zRQomwQZlmycP4EnE8djV9KwONgfRVccUIWZa4YM1uSOvKuw1gBOoTPQwr6Qz8NwEqqX FlTbaBtxllMJWlPa4v3LtL1dGVqdfqA7OkTd8FRZoFjtMsSsUcYofePulMzloI9j/gES w7gA== X-Gm-Message-State: AOJu0Yx8YvZchr0BC0G76gJ1oJHH+YvbAjFl5eNrG9YZr+rwKrJmS5cO e+BqhCsaBajMiWeF1pYRWTcMnMewXQu6QCKsa4OLI7vH6pSjB2QC5i1MZw== X-Gm-Gg: ASbGncvKqtbr4Y19PqKJLDcdxyZPkS0pcoFumcBx03pEr3C+bj8dbX2n5QTdL6oOXFd s7Kg0rTXEGHTjPsQqb4aE35Ei5YBiSpelosBats9I38MnW6nd39WN/kCYvkFCkmNBqBC7WORVSp Qt+4vJSQ8j3pD0gkj12ENBQLBHnoSoolcSRsve+gQG66MCEA48lWVn2cGmhIJ3cKY+ASQq/ni1j g5XZzObv5pBaB6x6TtZg0F2i9/ocEvukBr7dY29Yao22yfXLzGH4kKZARVY4XJhcYdBY8xqoj0e +DKxoY7jrny9FGjA X-Google-Smtp-Source: AGHT+IEM7fDWf7SBGJ0zmuah+bMC/iFdPvRz3kgE8Go7zfkpexUsE5EYHSJmV+aSWvQ6CKc3xaRAzQ== X-Received: by 2002:a05:6402:254e:b0:5d9:6633:8eb1 with SMTP id 4fb4d7f45d1cf-5dcc15d5a63mr28822a12.14.1738602672438; Mon, 03 Feb 2025 09:11:12 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5dc724c93d9sm8047041a12.68.2025.02.03.09.11.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Feb 2025 09:11:11 -0800 (PST) Message-Id: In-Reply-To: References: Date: Mon, 03 Feb 2025 17:11:05 +0000 Subject: [PATCH v3 3/5] backfill: add --min-batch-size= option Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im, me@ttaylorr.com, johncai86@gmail.com, newren@gmail.com, christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com, jonathantanmy@google.com, karthik.188@gmail.com, =?utf-8?q?Jean-No=C3=ABl?= AVILA , Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee Users may want to specify a minimum batch size for their needs. This is only a minimum: the path-walk API provides a list of OIDs that correspond to the same path, and thus it is optimal to allow delta compression across those objects in a single server request. We could consider limiting the request to have a maximum batch size in the future. For now, we let the path-walk API batches determine the boundaries. To get a feeling for the value of specifying the --min-batch-size parameter, I tested a number of open source repositories available on GitHub. The procedure was generally: 1. git clone --filter=blob:none 2. git backfill Checking the number of packfiles and the size of the .git/objects/pack directory helps to identify the effects of different batch sizes. For the Git repository, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 2 | 119 MB | | | 25K | 8 | 290 MB | 24s | | 50K | 5 | 290 MB | 24s | | 100K | 4 | 290 MB | 29s | Other than the packfile counts decreasing as we need fewer batches, the size and time required is not changing much for this small example. For the nodejs/node repository, we see these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|--------| | (Initial clone) | 2 | 330 MB | | | 25K | 19 | 1,222 MB | 1m 22s | | 50K | 11 | 1,221 MB | 1m 24s | | 100K | 7 | 1,223 MB | 1m 40s | | 250K | 4 | 1,224 MB | 2m 23s | | 500K | 3 | 1,216 MB | 4m 38s | Here, we don't have much difference in the size of the repo, though the 500K batch size results in a few MB gained. That comes at a cost of a much longer time. This extra time is due to server-side delta compression happening as the on-disk deltas don't appear to be reusable all the time. But for smaller batch sizes, the server is able to find reasonable deltas partly because we are asking for objects that appear in the same region of the directory tree and include all versions of a file at a specific path. To contrast this example, I tested the microsoft/fluentui repo, which has been known to have inefficient packing due to name hash collisions. These results are found before GitHub had the opportunity to repack the server with more advanced name hash versions: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|--------| | (Initial clone) | 2 | 105 MB | | | 5K | 53 | 348 MB | 2m 26s | | 10K | 28 | 365 MB | 2m 22s | | 15K | 19 | 407 MB | 2m 21s | | 20K | 15 | 393 MB | 2m 28s | | 25K | 13 | 417 MB | 2m 06s | | 50K | 8 | 509 MB | 1m 34s | | 100K | 5 | 535 MB | 1m 56s | | 250K | 4 | 698 MB | 1m 33s | | 500K | 3 | 696 MB | 1m 42s | Here, a larger variety of batch sizes were chosen because of the great variation in results. By asking the server to download small batches corresponding to fewer paths at a time, the server is able to provide better compression for these batches than it would for a regular clone. A typical full clone for this repository would require 738 MB. This example justifies the choice to batch requests by path name, leading to improved communication with a server that is not optimally packed. Finally, the same experiment for the Linux repository had these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|---------| | (Initial clone) | 2 | 2,153 MB | | | 25K | 63 | 6,380 MB | 14m 08s | | 50K | 58 | 6,126 MB | 15m 11s | | 100K | 30 | 6,135 MB | 18m 11s | | 250K | 14 | 6,146 MB | 18m 22s | | 500K | 8 | 6,143 MB | 33m 29s | Even in this example, where the default name hash algorithm leads to decent compression of the Linux kernel repository, there is value for selecting a smaller batch size, to a limit. The 25K batch size has the fastest time, but uses 250 MB more than the 50K batch size. The 500K batch size took much more time due to server compression time and thus we should avoid large batch sizes like this. Based on these experiments, a batch size of 50,000 was chosen as the default value. Signed-off-by: Derrick Stolee --- Documentation/git-backfill.txt | 12 +++++++++++- builtin/backfill.c | 4 +++- t/t5620-backfill.sh | 18 ++++++++++++++++++ 3 files changed, 32 insertions(+), 2 deletions(-) diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt index 56cbb9ffd82..136a1f1d294 100644 --- a/Documentation/git-backfill.txt +++ b/Documentation/git-backfill.txt @@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone SYNOPSIS -------- [synopsis] -git backfill [] +git backfill [--min-batch-size=] DESCRIPTION ----------- @@ -47,6 +47,16 @@ commit. This set can be restricted or expanded using various options. THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE. + +OPTIONS +------- + +`--min-batch-size=`:: + Specify a minimum size for a batch of missing objects to request + from the server. This size may be exceeded by the last set of + blobs seen at a given path. The default minimum batch size is + 50,000. + SEE ALSO -------- linkgit:git-clone[1]. diff --git a/builtin/backfill.c b/builtin/backfill.c index 0eca175a7fe..cfebee6e17b 100644 --- a/builtin/backfill.c +++ b/builtin/backfill.c @@ -21,7 +21,7 @@ #include "path-walk.h" static const char * const builtin_backfill_usage[] = { - N_("git backfill []"), + N_("git backfill [--min-batch-size=]"), NULL }; @@ -108,6 +108,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit .min_batch_size = 50000, }; struct option options[] = { + OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size, + N_("Minimum number of objects to request at a time")), OPT_END(), }; diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh index 64326362d80..36107a51c54 100755 --- a/t/t5620-backfill.sh +++ b/t/t5620-backfill.sh @@ -59,6 +59,24 @@ test_expect_success 'do partial clone 1, backfill gets all objects' ' test_line_count = 0 revs2 ' +test_expect_success 'do partial clone 2, backfill min batch size' ' + git clone --no-checkout --filter=blob:none \ + --single-branch --branch=main \ + "file://$(pwd)/srv.bare" backfill2 && + + GIT_TRACE2_EVENT="$(pwd)/batch-trace" git \ + -C backfill2 backfill --min-batch-size=20 && + + # Batches were used + test_trace2_data promisor fetch_count 20 matches && + test_line_count = 2 matches && + test_trace2_data promisor fetch_count 8 revs2 && + test_line_count = 0 revs2 +' + . "$TEST_DIRECTORY"/lib-httpd.sh start_httpd From patchwork Mon Feb 3 17:11:06 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13957885 Received: from mail-ed1-f50.google.com (mail-ed1-f50.google.com [209.85.208.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 164F520E018 for ; Mon, 3 Feb 2025 17:11:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602680; cv=none; b=MLM+u8peBk+uuoWxATpKzmIcpKGSQDMDqYnM2t5O+jUIavI7ZVQetAKuaWPyxhmqo0ju56T1KZRxpdC5Xg6FQIAUw3mhHLyqxcJ9CcthKnRI4u/iAVfsWJ7BOYmrzd1VrYloEl3S4lEI8w+jYljCFL2XY3qF909i6jPSXhvC18o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602680; c=relaxed/simple; bh=jr+cmOuPPBphYVy1Hh/TfoLUSv+n71zJBPxzQjzt3DM=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=RJoSOd1Ex4pJkECl30MPEwyok1qFj2n/t5NFtAMokdj4BCIOMmObzWl/TiF4BbfxcW7PEDW+SF+Cv9EvMNw8tnp2gZIZWHST7XS3fNjP+NLzgn2uS6QO7qjget5dHVA38r5jBp2Z2ZSAfesAKVPLsMQPxSF33DOogQbjpP+5U4M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=l9jDuJBM; arc=none smtp.client-ip=209.85.208.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="l9jDuJBM" Received: by mail-ed1-f50.google.com with SMTP id 4fb4d7f45d1cf-5d41848901bso9208944a12.0 for ; Mon, 03 Feb 2025 09:11:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738602674; x=1739207474; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=BMpLqSc1KKHNUlOrC1Z7vui7p6dsXPtLo1Spe/UF+Ik=; b=l9jDuJBMFbSBez410I8hazcLDEz0G4R0h2UsrZepYMbtC3lkLJAYe+eH0QgQL2Hlxu V84knRl2ZdKget9AR9qhCGfG0cYde5nJAJXP0Ps0OKhpbVYkIF+dP0CGNORmlBB2Ye1P Z79oJlIPunxoltGCOZfDXAdUweVtmXbxsWXHNZrtBHiHE4UWmMyqKL9PNizoDmc9G9RO tCF6U+Y2rVygDr0PgS7U2PhFtJwUzwnS9J2U7vQzfI4nsRj1I1uegSExBBImFs3rzjj7 bTLVMMzcXvxZckQyGol7fe3fJHTZchdP3ryR6RAX0qp0PtK2DzRv8/NTwsGh3UD38AjJ lmaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738602674; x=1739207474; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BMpLqSc1KKHNUlOrC1Z7vui7p6dsXPtLo1Spe/UF+Ik=; b=cPJcekgZ7UPxbnZt8pyVN+KGeEQOys59L47Vzp8xq3N6x0caZihWK31YbqkTB0vHM4 GHXtLNnUqkxvGbn2TzB+og8VTj/YDfp/Roott3Ok9uRmUY2poshftld1YNuSQdBHAmnv XPxWBtpwBMkqgdYm88laLuGxKunrvtvllPsH/TDOtQN0kMqfMLbYN07IWBWrOHoValtU IAWcdoMDdvjyTFSEaucMTdycc1GODVszjC5rVSg8aLYqAxPbrj54w/Mwy/whRC7FxK95 133hraPeIjHoT93OxRH4RmCWaXgSqUMt6pd4bsZcDyLkQCVl8E/ukB7fb8ev4aLCXtcY ZxHg== X-Gm-Message-State: AOJu0YzLjXYp1r70TreBQx3lOA+jAUJc1foVx5NRVfgxtxKVNVnC7WXe h+7w3/oc3+HkRIhxOcmUHdUGZXJ9g0NizNkIQmlMnXgeWnMM9Nk+DOQ1uQ== X-Gm-Gg: ASbGncurKlCQR1GKY3/ViMgE1EeeYseBCWv8yVoIZtjlobImPfYG/PIMFOpOdTBCBlm tP5pz+I46FeqAhltyvUtRHixYJ3CWDtburVliLk2L1Yx1pvYy5PsA308ft6D558HvqobQXPSVts 1EDIl0JP3s9UzEYCIvjKaQ5C9e5f3OZ/HGYm2yixN2WnorwpKQYN1M5na2s9+HS43U6hAHh9fzd YDTYZxq43LpyXSD2gAXvPTps8TkToI1rqOEmI7HlGGE4t6NytD8JJC+wVhaz3aQ3BMgk0a/TzUf Z25Lxo+3SsmF6+YV X-Google-Smtp-Source: AGHT+IEevI8dgMOjeQ6KWIY4hAfCYklZSK6sxNC335nuWAhL9cqKb6ARNC33TP7HotvR8vXpKn3o/A== X-Received: by 2002:a05:6402:4024:b0:5dc:8fc1:b3b5 with SMTP id 4fb4d7f45d1cf-5dcc1592f25mr47150a12.15.1738602673565; Mon, 03 Feb 2025 09:11:13 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5dc724045e5sm8022824a12.37.2025.02.03.09.11.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Feb 2025 09:11:12 -0800 (PST) Message-Id: In-Reply-To: References: Date: Mon, 03 Feb 2025 17:11:06 +0000 Subject: [PATCH v3 4/5] backfill: add --sparse option Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im, me@ttaylorr.com, johncai86@gmail.com, newren@gmail.com, christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com, jonathantanmy@google.com, karthik.188@gmail.com, =?utf-8?q?Jean-No=C3=ABl?= AVILA , Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensive as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Non-cone mode can describe the included files using both positive and negative patterns, which changes the possible return values of path_matches_pattern_list(). Test both kinds of matches for increased coverage. To test this, we can create a blobless sparse clone, expand the sparse-checkout slightly, and then run 'git backfill --sparse' to see how much data is downloaded. The general steps are 1. git clone --filter=blob:none --sparse 2. git sparse-checkout set ... 3. git backfill --sparse For the Git repository with the 'builtin' directory in the sparse-checkout, we get these results for various batch sizes: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 3 | 110 MB | | | 10K | 12 | 192 MB | 17.2s | | 15K | 9 | 192 MB | 15.5s | | 20K | 8 | 192 MB | 15.5s | | 25K | 7 | 192 MB | 14.7s | This case matters less because a full clone of the Git repository from GitHub is currently at 277 MB. Using a copy of the Linux repository with the 'kernel/' directory in the sparse-checkout, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|------| | (Initial clone) | 2 | 1,876 MB | | | 10K | 11 | 2,187 MB | 46s | | 25K | 7 | 2,188 MB | 43s | | 50K | 5 | 2,194 MB | 44s | | 100K | 4 | 2,194 MB | 48s | This case is more meaningful because a full clone of the Linux repository is currently over 6 GB, so this is a valuable way to download a fraction of the repository and no longer need network access for all reachable objects within the sparse-checkout. Choosing a batch size will depend on a lot of factors, including the user's network speed or reliability, the repository's file structure, and how many versions there are of the file within the sparse-checkout scope. There will not be a one-size-fits-all solution. Signed-off-by: Derrick Stolee --- Documentation/git-backfill.txt | 6 +- Documentation/technical/api-path-walk.txt | 8 +++ builtin/backfill.c | 15 +++- dir.c | 10 +-- dir.h | 3 + path-walk.c | 28 ++++++-- path-walk.h | 11 +++ t/helper/test-path-walk.c | 22 +++++- t/t5620-backfill.sh | 88 +++++++++++++++++++++++ t/t6601-path-walk.sh | 32 +++++++++ 10 files changed, 208 insertions(+), 15 deletions(-) diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt index 136a1f1d294..a28678983e3 100644 --- a/Documentation/git-backfill.txt +++ b/Documentation/git-backfill.txt @@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone SYNOPSIS -------- [synopsis] -git backfill [--min-batch-size=] +git backfill [--min-batch-size=] [--[no-]sparse] DESCRIPTION ----------- @@ -57,6 +57,10 @@ OPTIONS blobs seen at a given path. The default minimum batch size is 50,000. +`--[no-]sparse`:: + Only download objects if they appear at a path that matches the + current sparse-checkout. + SEE ALSO -------- linkgit:git-clone[1]. diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt index 1fba0ce04cb..3e089211fb4 100644 --- a/Documentation/technical/api-path-walk.txt +++ b/Documentation/technical/api-path-walk.txt @@ -56,6 +56,14 @@ better off using the revision walk API instead. the revision walk so that the walk emits commits marked with the `UNINTERESTING` flag. +`pl`:: + This pattern list pointer allows focusing the path-walk search to + a set of patterns, only emitting paths that match the given + patterns. See linkgit:gitignore[5] or + linkgit:git-sparse-checkout[1] for details about pattern lists. + When the pattern list uses cone-mode patterns, then the path-walk + API can prune the set of paths it walks to improve performance. + Examples -------- diff --git a/builtin/backfill.c b/builtin/backfill.c index cfebee6e17b..d7b997fd6f7 100644 --- a/builtin/backfill.c +++ b/builtin/backfill.c @@ -4,6 +4,7 @@ #include "parse-options.h" #include "repository.h" #include "commit.h" +#include "dir.h" #include "hex.h" #include "tree.h" #include "tree-walk.h" @@ -21,7 +22,7 @@ #include "path-walk.h" static const char * const builtin_backfill_usage[] = { - N_("git backfill [--min-batch-size=]"), + N_("git backfill [--min-batch-size=] [--[no-]sparse]"), NULL }; @@ -29,6 +30,7 @@ struct backfill_context { struct repository *repo; struct oid_array current_batch; size_t min_batch_size; + int sparse; }; static void backfill_context_clear(struct backfill_context *ctx) @@ -78,6 +80,14 @@ static int do_backfill(struct backfill_context *ctx) struct path_walk_info info = PATH_WALK_INFO_INIT; int ret; + if (ctx->sparse) { + CALLOC_ARRAY(info.pl, 1); + if (get_sparse_checkout_patterns(info.pl)) { + path_walk_info_clear(&info); + return error(_("problem loading sparse-checkout")); + } + } + repo_init_revisions(ctx->repo, &revs, ""); handle_revision_arg("HEAD", &revs, 0, 0); @@ -106,10 +116,13 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit .repo = repo, .current_batch = OID_ARRAY_INIT, .min_batch_size = 50000, + .sparse = 0, }; struct option options[] = { OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size, N_("Minimum number of objects to request at a time")), + OPT_BOOL(0, "sparse", &ctx.sparse, + N_("Restrict the missing objects to the current sparse-checkout")), OPT_END(), }; diff --git a/dir.c b/dir.c index 5b2181e5899..16ccfe7e4e8 100644 --- a/dir.c +++ b/dir.c @@ -1093,10 +1093,6 @@ static void invalidate_directory(struct untracked_cache *uc, dir->dirs[i]->recurse = 0; } -static int add_patterns_from_buffer(char *buf, size_t size, - const char *base, int baselen, - struct pattern_list *pl); - /* Flags for add_patterns() */ #define PATTERN_NOFOLLOW (1<<0) @@ -1186,9 +1182,9 @@ static int add_patterns(const char *fname, const char *base, int baselen, return 0; } -static int add_patterns_from_buffer(char *buf, size_t size, - const char *base, int baselen, - struct pattern_list *pl) +int add_patterns_from_buffer(char *buf, size_t size, + const char *base, int baselen, + struct pattern_list *pl) { char *orig = buf; int i, lineno = 1; diff --git a/dir.h b/dir.h index a3a2f00f5d9..6cfef5df660 100644 --- a/dir.h +++ b/dir.h @@ -467,6 +467,9 @@ void add_patterns_from_file(struct dir_struct *, const char *fname); int add_patterns_from_blob_to_list(struct object_id *oid, const char *base, int baselen, struct pattern_list *pl); +int add_patterns_from_buffer(char *buf, size_t size, + const char *base, int baselen, + struct pattern_list *pl); void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen); void add_pattern(const char *string, const char *base, int baselen, struct pattern_list *pl, int srcpos); diff --git a/path-walk.c b/path-walk.c index 9715a5550ef..341bdd2ba4e 100644 --- a/path-walk.c +++ b/path-walk.c @@ -12,6 +12,7 @@ #include "object.h" #include "oid-array.h" #include "prio-queue.h" +#include "repository.h" #include "revision.h" #include "string-list.h" #include "strmap.h" @@ -172,6 +173,23 @@ static int add_tree_entries(struct path_walk_context *ctx, if (type == OBJ_TREE) strbuf_addch(&path, '/'); + if (ctx->info->pl) { + int dtype; + enum pattern_match_result match; + match = path_matches_pattern_list(path.buf, path.len, + path.buf + base_len, &dtype, + ctx->info->pl, + ctx->repo->index); + + if (ctx->info->pl->use_cone_patterns && + match == NOT_MATCHED) + continue; + else if (!ctx->info->pl->use_cone_patterns && + type == OBJ_BLOB && + match != MATCHED) + continue; + } + if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) { CALLOC_ARRAY(list, 1); list->type = type; @@ -582,10 +600,10 @@ void path_walk_info_init(struct path_walk_info *info) memcpy(info, &empty, sizeof(empty)); } -void path_walk_info_clear(struct path_walk_info *info UNUSED) +void path_walk_info_clear(struct path_walk_info *info) { - /* - * This destructor is empty for now, as info->revs - * is not owned by 'struct path_walk_info'. - */ + if (info->pl) { + clear_pattern_list(info->pl); + free(info->pl); + } } diff --git a/path-walk.h b/path-walk.h index 414d6db23c2..473ee9d361c 100644 --- a/path-walk.h +++ b/path-walk.h @@ -6,6 +6,7 @@ struct rev_info; struct oid_array; +struct pattern_list; /** * The type of a function pointer for the method that is called on a list of @@ -48,6 +49,16 @@ struct path_walk_info { * walk the children of such trees. */ int prune_all_uninteresting; + + /** + * Specify a sparse-checkout definition to match our paths to. Do not + * walk outside of this sparse definition. If the patterns are in + * cone mode, then the search may prune directories that are outside + * of the cone. If not in cone mode, then all tree paths will be + * explored but the path_fn will only be called when the path matches + * the sparse-checkout patterns. + */ + struct pattern_list *pl; }; #define PATH_WALK_INFO_INIT { \ diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c index 7f2d409c5bc..61e845e5ec2 100644 --- a/t/helper/test-path-walk.c +++ b/t/helper/test-path-walk.c @@ -1,6 +1,7 @@ #define USE_THE_REPOSITORY_VARIABLE #include "test-tool.h" +#include "dir.h" #include "environment.h" #include "hex.h" #include "object-name.h" @@ -9,6 +10,7 @@ #include "revision.h" #include "setup.h" #include "parse-options.h" +#include "strbuf.h" #include "path-walk.h" #include "oid-array.h" @@ -65,7 +67,7 @@ static int emit_block(const char *path, struct oid_array *oids, int cmd__path_walk(int argc, const char **argv) { - int res; + int res, stdin_pl = 0; struct rev_info revs = REV_INFO_INIT; struct path_walk_info info = PATH_WALK_INFO_INIT; struct path_walk_test_data data = { 0 }; @@ -80,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv) N_("toggle inclusion of tree objects")), OPT_BOOL(0, "prune", &info.prune_all_uninteresting, N_("toggle pruning of uninteresting paths")), + OPT_BOOL(0, "stdin-pl", &stdin_pl, + N_("read a pattern list over stdin")), OPT_END(), }; @@ -99,6 +103,17 @@ int cmd__path_walk(int argc, const char **argv) info.path_fn = emit_block; info.path_fn_data = &data; + if (stdin_pl) { + struct strbuf in = STRBUF_INIT; + CALLOC_ARRAY(info.pl, 1); + + info.pl->use_cone_patterns = 1; + + strbuf_fread(&in, 2048, stdin); + add_patterns_from_buffer(in.buf, in.len, "", 0, info.pl); + strbuf_release(&in); + } + res = walk_objects_by_path(&info); printf("commits:%" PRIuMAX "\n" @@ -107,6 +122,11 @@ int cmd__path_walk(int argc, const char **argv) "tags:%" PRIuMAX "\n", data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr); + if (info.pl) { + clear_pattern_list(info.pl); + free(info.pl); + } + release_revisions(&revs); return res; } diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh index 36107a51c54..6b72e9d0e31 100755 --- a/t/t5620-backfill.sh +++ b/t/t5620-backfill.sh @@ -77,6 +77,94 @@ test_expect_success 'do partial clone 2, backfill min batch size' ' test_line_count = 0 revs2 ' +test_expect_success 'backfill --sparse' ' + git clone --sparse --filter=blob:none \ + --single-branch --branch=main \ + "file://$(pwd)/srv.bare" backfill3 && + + # Initial checkout includes four files at root. + git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing && + test_line_count = 44 missing && + + # Initial sparse-checkout is just the files at root, so we get the + # older versions of the four files at tip. + GIT_TRACE2_EVENT="$(pwd)/sparse-trace1" git \ + -C backfill3 backfill --sparse && + test_trace2_data promisor fetch_count 4 missing && + test_line_count = 40 missing && + + # Expand the sparse-checkout to include 'd' recursively. This + # engages the algorithm to skip the trees for 'a'. Note that + # the "sparse-checkout set" command downloads the objects at tip + # to satisfy the current checkout. + git -C backfill3 sparse-checkout set d && + GIT_TRACE2_EVENT="$(pwd)/sparse-trace2" git \ + -C backfill3 backfill --sparse && + test_trace2_data promisor fetch_count 8 missing && + test_line_count = 24 missing +' + +test_expect_success 'backfill --sparse without cone mode (positive)' ' + git clone --no-checkout --filter=blob:none \ + --single-branch --branch=main \ + "file://$(pwd)/srv.bare" backfill4 && + + # No blobs yet + git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing && + test_line_count = 48 missing && + + # Define sparse-checkout by filename regardless of parent directory. + # This downloads 6 blobs to satisfy the checkout. + git -C backfill4 sparse-checkout set --no-cone "**/file.1.txt" && + git -C backfill4 checkout main && + + # Track new blob count + git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing && + test_line_count = 42 missing && + + GIT_TRACE2_EVENT="$(pwd)/no-cone-trace1" git \ + -C backfill4 backfill --sparse && + test_trace2_data promisor fetch_count 6 missing && + test_line_count = 36 missing +' + +test_expect_success 'backfill --sparse without cone mode (negative)' ' + git clone --no-checkout --filter=blob:none \ + --single-branch --branch=main \ + "file://$(pwd)/srv.bare" backfill5 && + + # No blobs yet + git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing && + test_line_count = 48 missing && + + # Define sparse-checkout by filename regardless of parent directory. + # This downloads 18 blobs to satisfy the checkout + git -C backfill5 sparse-checkout set --no-cone "**/file*" "!**/file.1.txt" && + git -C backfill5 checkout main && + + # Track new blob count + git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing && + test_line_count = 30 missing && + + GIT_TRACE2_EVENT="$(pwd)/no-cone-trace2" git \ + -C backfill5 backfill --sparse && + test_trace2_data promisor fetch_count 18 missing && + test_line_count = 12 missing +' + . "$TEST_DIRECTORY"/lib-httpd.sh start_httpd diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh index 5f04acb8a2f..c89b0f1e19d 100755 --- a/t/t6601-path-walk.sh +++ b/t/t6601-path-walk.sh @@ -176,6 +176,38 @@ test_expect_success 'branches and indexed objects mix well' ' test_cmp_sorted expect out ' +test_expect_success 'base & topic, sparse' ' + cat >patterns <<-EOF && + /* + !/*/ + /left/ + EOF + + test-tool path-walk --stdin-pl -- base topic out && + + cat >expect <<-EOF && + 0:commit::$(git rev-parse topic) + 0:commit::$(git rev-parse base) + 0:commit::$(git rev-parse base~1) + 0:commit::$(git rev-parse base~2) + 1:tree::$(git rev-parse topic^{tree}) + 1:tree::$(git rev-parse base^{tree}) + 1:tree::$(git rev-parse base~1^{tree}) + 1:tree::$(git rev-parse base~2^{tree}) + 2:blob:a:$(git rev-parse base~2:a) + 3:tree:left/:$(git rev-parse base:left) + 3:tree:left/:$(git rev-parse base~2:left) + 4:blob:left/b:$(git rev-parse base~2:left/b) + 4:blob:left/b:$(git rev-parse base:left/b) + blobs:3 + commits:4 + tags:0 + trees:6 + EOF + + test_cmp_sorted expect out +' + test_expect_success 'topic only' ' test-tool path-walk -- topic >out && From patchwork Mon Feb 3 17:11:07 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 13957884 Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B02CF20E010 for ; Mon, 3 Feb 2025 17:11:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602679; cv=none; b=XcTzDVq/hLKGlf0vyO/7f0W/ertzapInIryld00E6xMUcyFwykXTHZLKSBg4pnP2wBQPr8K1MocZMyQWPelt9TaTG3mhVRzXoVOMPd4dgD2m1AhDbYUlhtmbBIabL0rbvp+GfOpDTQljeWwyF0C2lbZWStIXd8UfSkWXBecYn/Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738602679; c=relaxed/simple; bh=sK5r3geJ32RRKyiwX6IU/j98CfcS15SplYsApNVP6jM=; h=Message-Id:In-Reply-To:References:From:Date:Subject:Content-Type: MIME-Version:To:Cc; b=pY8AJ7rR8HSoZhLahXAuLuqaIFIqfSVyRNED7oCZm6lbp2p+aolpyMhWgs9tuxcHufbIgn090zKuo4UNa7A1NJhLGCdMZYLqUIWugUJFzoP4u0/qgdfBSmGmt6KFwV0mJsU+R0fHp6gmqUBai7rVxFACLeNcJRzBWu8m9Ksu7hc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=VSidotxP; arc=none smtp.client-ip=209.85.218.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VSidotxP" Received: by mail-ej1-f46.google.com with SMTP id a640c23a62f3a-aafc9d75f8bso932466566b.2 for ; Mon, 03 Feb 2025 09:11:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738602675; x=1739207475; darn=vger.kernel.org; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=bCSRk67UWubQZv8ZkD5P0kmmXo+HTrEOf0+xdgcudkc=; b=VSidotxPSaanHgeVWFa3OXPWODSNhmasbYlfTfw0lmqkWyu6SjidawiGVu+M8YlI8x J3eZ6KDcnnkMJMOKeiiZpOjPy6VFq74ANOKt9AEj8LL/1z4ZyxXbML0iSwAImcljH+CF 9WR47fwT7ffUkOXgqLNofYAu9CowSr2c/LLC17eia7rWvK/nn8ZJIICk2gN3jB/pXHB7 Eg3YC+jRYUVWX4a2Ay8ZnZv5qI0aHwXuQ66fcSETTFLOdyj355eXE8v0RwwTmNwEo7WZ 4RcJ8OvBK/dn+9lUAmqktz1AQ+H11xAz99+/ADRRFW2NP8tf2shJOZtW8PoFtQDxrLiL 4DgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738602675; x=1739207475; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bCSRk67UWubQZv8ZkD5P0kmmXo+HTrEOf0+xdgcudkc=; b=xAxrggWigJtIiSY7+vjzsbYJSKTJ0wtKEuvlkoGz0tCPmJOzJA4mqIym4jKi/DFvGe b1PPQivfhVNgM9w9xxO4NKP9RQAX5OQEnBeaCHOwoTzbctDUjpjysnbGNu1WSYyGVw5U YNLemSrydSwKtVUMidPG1Eds3oGjera+CXtsO5+jTsjUT7BRa8VWPe4uejKoVPLRjLvt LMALsbniKA6aiRhg7pOm60vf8t2LuaA+M0CpgPlrdPlp8EyEdRG9EHf/kPLjfzoQrDYk Gqx1dnfdJ6h3I5lsLqmz5jAA7RYjJiVzggawtSkVn2lYbqrDUtcqqcSBrbquA+s+LRuF g5lg== X-Gm-Message-State: AOJu0YxRTsv241FT6UxqKph+DlLT/imnASL2gYWuYG/DazUn1XODD1rh kle23iiQN8QHG02vusfq8LVE2RwuHDRCXDlqe5nWZLUXlmesJ8X8IL/Tcw== X-Gm-Gg: ASbGncumTvhYtnU5LriQXfkUa9/PE4tKU+kBph5zG/IAGc7RUB3wdGZESHgMh5CcK4/ Yr5uOiV/Xg0BL3WfQtwMHYNT9p+qCt6IdfOysLjwayQnpXN+M2VTAldyNG1jdXdphYyEs2JkN/w GdTfkKSq5t6hVkG3BVMgccZydqT0OeiCbXureuq4eToV9lz3oqNUGxyhX7Nqvcqi37oPXUXrRiR eiIqjS/+mCoAcvrnrMIEPicqoB2vzLTKHk6d/L/hguxbQjKlbm3a61LAV2c3uYzT0ZpQU8a6ep8 EyEF1mTEBtffRj/E X-Google-Smtp-Source: AGHT+IERynYy3Q3g9FTP2VFjxEBR00frOPxT8VraYWKQdpPxO8SlDBFaERhMN52rtEbCxPqEUiIFXg== X-Received: by 2002:a17:907:9455:b0:ab6:fd25:3c72 with SMTP id a640c23a62f3a-ab6fd253f3amr1926829466b.10.1738602674437; Mon, 03 Feb 2025 09:11:14 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ab6e47a7fd9sm782252266b.34.2025.02.03.09.11.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Feb 2025 09:11:14 -0800 (PST) Message-Id: In-Reply-To: References: Date: Mon, 03 Feb 2025 17:11:07 +0000 Subject: [PATCH v3 5/5] backfill: assume --sparse when sparse-checkout is enabled Fcc: Sent Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, ps@pks.im, me@ttaylorr.com, johncai86@gmail.com, newren@gmail.com, christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com, jonathantanmy@google.com, karthik.188@gmail.com, =?utf-8?q?Jean-No=C3=ABl?= AVILA , Derrick Stolee , Derrick Stolee From: Derrick Stolee From: Derrick Stolee The previous change introduced the '--[no-]sparse' option for the 'git backfill' command, but did not assume it as enabled by default. However, this is likely the behavior that users will most often want to happen. Without this default, users with a small sparse-checkout may be confused when 'git backfill' downloads every version of every object in the full history. However, this is left as a separate change so this decision can be reviewed independently of the value of the '--[no-]sparse' option. Add a test of adding the '--sparse' option to a repo without sparse-checkout to make it clear that supplying it without a sparse-checkout is an error. Signed-off-by: Derrick Stolee --- Documentation/git-backfill.txt | 3 ++- builtin/backfill.c | 7 +++++++ t/t5620-backfill.sh | 13 ++++++++++++- 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt index a28678983e3..95623051f78 100644 --- a/Documentation/git-backfill.txt +++ b/Documentation/git-backfill.txt @@ -59,7 +59,8 @@ OPTIONS `--[no-]sparse`:: Only download objects if they appear at a path that matches the - current sparse-checkout. + current sparse-checkout. If the sparse-checkout feature is enabled, + then `--sparse` is assumed and can be disabled with `--no-sparse`. SEE ALSO -------- diff --git a/builtin/backfill.c b/builtin/backfill.c index d7b997fd6f7..d7ee84692f3 100644 --- a/builtin/backfill.c +++ b/builtin/backfill.c @@ -1,3 +1,6 @@ +/* We need this macro to access core_apply_sparse_checkout */ +#define USE_THE_REPOSITORY_VARIABLE + #include "builtin.h" #include "git-compat-util.h" #include "config.h" @@ -5,6 +8,7 @@ #include "repository.h" #include "commit.h" #include "dir.h" +#include "environment.h" #include "hex.h" #include "tree.h" #include "tree-walk.h" @@ -133,6 +137,9 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit repo_config(repo, git_default_config, NULL); + if (ctx.sparse < 0) + ctx.sparse = core_apply_sparse_checkout; + result = do_backfill(&ctx); backfill_context_clear(&ctx); return result; diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh index 6b72e9d0e31..58c81556e72 100755 --- a/t/t5620-backfill.sh +++ b/t/t5620-backfill.sh @@ -77,6 +77,12 @@ test_expect_success 'do partial clone 2, backfill min batch size' ' test_line_count = 0 revs2 ' +test_expect_success 'backfill --sparse without sparse-checkout fails' ' + git init not-sparse && + test_must_fail git -C not-sparse backfill --sparse 2>err && + grep "problem loading sparse-checkout" err +' + test_expect_success 'backfill --sparse' ' git clone --sparse --filter=blob:none \ --single-branch --branch=main \ @@ -105,7 +111,12 @@ test_expect_success 'backfill --sparse' ' test_trace2_data promisor fetch_count 8 missing && - test_line_count = 24 missing + test_line_count = 24 missing && + + # Disabling the --sparse option (on by default) will download everything + git -C backfill3 backfill --no-sparse && + git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing && + test_line_count = 0 missing ' test_expect_success 'backfill --sparse without cone mode (positive)' '