From patchwork Tue Mar 29 13:56:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12794880 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 703E4C433FE for ; Tue, 29 Mar 2022 13:56:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237623AbiC2N6L (ORCPT ); Tue, 29 Mar 2022 09:58:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41506 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237604AbiC2N6F (ORCPT ); Tue, 29 Mar 2022 09:58:05 -0400 Received: from mail-ej1-x62d.google.com (mail-ej1-x62d.google.com [IPv6:2a00:1450:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 15CAE22385E for ; Tue, 29 Mar 2022 06:56:22 -0700 (PDT) Received: by mail-ej1-x62d.google.com with SMTP id j15so35293773eje.9 for ; Tue, 29 Mar 2022 06:56:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=I1ZPHYKIqq6Ky7VafODvgg0vnAEtT/D/0hxP3toVZ80=; b=lac8Yu2U8xgiLOgAVSEo72hjrEYs+uMKNTBNR7LOfSFtfDKFKVpXRIIR7dpxbyC40S ioXBxTf75bp5WoGMjsq12CoVOMsQDZQ9Kax0d0X0bg+nyhE5/lzMDuB0URDYPtgulNgF j4R/6uiU9xbhCiUNJLIc3BHwcLv9OHSiHlMUcF/gxysy164N0/1HcJF+U2M2YqiZslfb K69tLpvBUxSTZbgiw4ymHJOdV+AfP9xj1FjZmk7g3b8MkA2DxDvnjQUN/CEP47bHoSKL P01eg7a2owvUxSWNQ1LfTny7GwnA/whYAwekJzxOaoB4ELNdKrXvdPkhPCOID/SKcnvg p2OQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=I1ZPHYKIqq6Ky7VafODvgg0vnAEtT/D/0hxP3toVZ80=; b=hF0SbxNxnhAsSIeERd2+xLGi4pL+VVo2UHN2V74H/GzN6pLnR2r8T2oiYs+TLg3ti/ XyMNLW1hYQn5Ox3k89I7be6k6APzJc751WtkxFFf3FcXCDUyTRKKsLUjKfAjAAN5D6HV 2JaK54oQzQB41eoWKEMqh3oHuvpngmoslkqckH1uFC7QQ5/oXOuD4EFFFpAD4Dv0kZ4P Yh5WBs2i02oR7BVhGu9QnUIhL/oz11v1SRi/qF8dLfr76Tqm2RjdtUqv+YeA8FQjtHYw +CCtg7uFdFef/KtI8da8pRKv8e8PxQt8TC+Hj/okbajPij31sTwVHbdlTnpwQzxwN/R4 dgZA== X-Gm-Message-State: AOAM5325D5O0l0/seJU8EfGSaizIA6PPLswu0QyEeFamXDDp1aim5mga 4i90nUPrOpHrodOkkrrFY94InUmtfH8RKg== X-Google-Smtp-Source: ABdhPJycR6/o0QFWVnHvlp6YrjGGtU8uXhRTZnWidgUYm+GOhtMGKqX7QJDMtbcNjpkWJgUIcqU0cg== X-Received: by 2002:a17:906:9b85:b0:6db:ab80:7924 with SMTP id dd5-20020a1709069b8500b006dbab807924mr34523309ejc.160.1648562179444; Tue, 29 Mar 2022 06:56:19 -0700 (PDT) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id ds5-20020a170907724500b006df8f39dadesm7006601ejc.218.2022.03.29.06.56.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Mar 2022 06:56:18 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , Han Xin , Jiang Xin , =?utf-8?q?Ren=C3=A9_Scharfe?= , Derrick Stolee , Philip Oakley , Neeraj Singh , Elijah Newren , Han Xin , Jiang Xin , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v12 1/8] unpack-objects: low memory footprint for get_data() in dry_run mode Date: Tue, 29 Mar 2022 15:56:06 +0200 Message-Id: X-Mailer: git-send-email 2.35.1.1548.g36973b18e52 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin As the name implies, "get_data(size)" will allocate and return a given amount of memory. Allocating memory for a large blob object may cause the system to run out of memory. Before preparing to replace calling of "get_data()" to unpack large blob objects in latter commits, refactor "get_data()" to reduce memory footprint for dry_run mode. Because in dry_run mode, "get_data()" is only used to check the integrity of data, and the returned buffer is not used at all, we can allocate a smaller buffer and reuse it as zstream output. Therefore, in dry_run mode, "get_data()" will release the allocated buffer and return NULL instead of returning garbage data. The "find [...]objects/?? -type f | wc -l" test idiom being used here is adapted from the same "find" use added to another test in d9545c7f465 (fast-import: implement unpack limit, 2016-04-25). Suggested-by: Jiang Xin Signed-off-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- builtin/unpack-objects.c | 34 ++++++++++++++++++--------- t/t5351-unpack-large-objects.sh | 41 +++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+), 11 deletions(-) create mode 100755 t/t5351-unpack-large-objects.sh diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index dbeb0680a58..e3d30025979 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -96,15 +96,26 @@ static void use(int bytes) display_throughput(progress, consumed_bytes); } +/* + * Decompress zstream from stdin and return specific size of data. + * The caller is responsible to free the returned buffer. + * + * But for dry_run mode, "get_data()" is only used to check the + * integrity of data, and the returned buffer is not used at all. + * Therefore, in dry_run mode, "get_data()" will release the small + * allocated buffer which is reused to hold temporary zstream output + * and return NULL instead of returning garbage data. + */ static void *get_data(unsigned long size) { git_zstream stream; - void *buf = xmallocz(size); + unsigned long bufsize = dry_run && size > 8192 ? 8192 : size; + void *buf = xmallocz(bufsize); memset(&stream, 0, sizeof(stream)); stream.next_out = buf; - stream.avail_out = size; + stream.avail_out = bufsize; stream.next_in = fill(1); stream.avail_in = len; git_inflate_init(&stream); @@ -124,8 +135,15 @@ static void *get_data(unsigned long size) } stream.next_in = fill(1); stream.avail_in = len; + if (dry_run) { + /* reuse the buffer in dry_run mode */ + stream.next_out = buf; + stream.avail_out = bufsize; + } } git_inflate_end(&stream); + if (dry_run) + FREE_AND_NULL(buf); return buf; } @@ -325,10 +343,8 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size, { void *buf = get_data(size); - if (!dry_run && buf) + if (buf) write_object(nr, type, buf, size); - else - free(buf); } static int resolve_against_held(unsigned nr, const struct object_id *base, @@ -358,10 +374,8 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, oidread(&base_oid, fill(the_hash_algo->rawsz)); use(the_hash_algo->rawsz); delta_data = get_data(delta_size); - if (dry_run || !delta_data) { - free(delta_data); + if (!delta_data) return; - } if (has_object_file(&base_oid)) ; /* Ok we have this one */ else if (resolve_against_held(nr, &base_oid, @@ -397,10 +411,8 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, die("offset value out of bound for delta base object"); delta_data = get_data(delta_size); - if (dry_run || !delta_data) { - free(delta_data); + if (!delta_data) return; - } lo = 0; hi = nr; while (lo < hi) { diff --git a/t/t5351-unpack-large-objects.sh b/t/t5351-unpack-large-objects.sh new file mode 100755 index 00000000000..8d84313221c --- /dev/null +++ b/t/t5351-unpack-large-objects.sh @@ -0,0 +1,41 @@ +#!/bin/sh +# +# Copyright (c) 2022 Han Xin +# + +test_description='git unpack-objects with large objects' + +. ./test-lib.sh + +prepare_dest () { + test_when_finished "rm -rf dest.git" && + git init --bare dest.git +} + +test_expect_success "create large objects (1.5 MB) and PACK" ' + test-tool genrandom foo 1500000 >big-blob && + test_commit --append foo big-blob && + test-tool genrandom bar 1500000 >big-blob && + test_commit --append bar big-blob && + PACK=$(echo HEAD | git pack-objects --revs pack) +' + +test_expect_success 'set memory limitation to 1MB' ' + GIT_ALLOC_LIMIT=1m && + export GIT_ALLOC_LIMIT +' + +test_expect_success 'unpack-objects failed under memory limitation' ' + prepare_dest && + test_must_fail git -C dest.git unpack-objects err && + grep "fatal: attempting to allocate" err +' + +test_expect_success 'unpack-objects works with memory limitation in dry-run mode' ' + prepare_dest && + git -C dest.git unpack-objects -n X-Patchwork-Id: 12794879 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 759DAC433EF for ; Tue, 29 Mar 2022 13:56:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237621AbiC2N6K (ORCPT ); Tue, 29 Mar 2022 09:58:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41516 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237605AbiC2N6F (ORCPT ); Tue, 29 Mar 2022 09:58:05 -0400 Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F5A522385F for ; Tue, 29 Mar 2022 06:56:22 -0700 (PDT) Received: by mail-ej1-x634.google.com with SMTP id yy13so35352093ejb.2 for ; Tue, 29 Mar 2022 06:56:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=GWY/unzhQkolwOpDjUfMMFSQ3k7/dBT5XnZDr53TTu8=; b=IfzH0UkMtb8fHEPRyrmojfnz+hF7DkNaqszXU7STf//PUEZmFkjwaxr2JNBbZykFA/ HjbIKDsRphdhSwoDZMjhN2G8oBWkU+pQf9MiedT/tO0ex+dPXNG8A8iC20SgvWXEsAvS PNJ5FR0t+pXMi/5/WQDc8rA8w8umlvYOY+PdJrMkWdZGlvheSm1kr+79T8pwfq4fFJpE fSCIRBStcP6fRr8sSZ+vJsX1CH4R4um8dhlFg1cN+ROQOiS6UFSnsF94OktmGWnNrX3r vsmPvjtVtf9trIBnQfBKuJ7Fb9wY6iTjELPqYOFEwnKkn6vtypeCB2bZcjDAI9h9uRNP 2+DQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=GWY/unzhQkolwOpDjUfMMFSQ3k7/dBT5XnZDr53TTu8=; b=mEHUM6bQ82M0sYJ4T4Enl3jb5UazMJabjOj6ZB0fGJDdBAXj6fIlom3TecjTyAXprd toEu/D14j9MewIbHzi7AirVXJ3UkU3cUYasnVXDGKAjVaYeXtesHIiAyvQ09klGSnkGr GiJHFhIkqqmDpI+pVzRydFtO/pBvgKghkBzqiRC9tqUcGakkCJF54Ers5MrncJplLwT/ Be1FJlSDYeNe383gY0HgvZwhCFYy9e/jMgN+cDT2X4XbziB87IVguAl8LpKc0nxg57LW wodoPSbq8HKoGJltQ/E6K20PNEzv+2SeRf0L92xozo7jSPzKTRBEpueWFGr36eQ95l5N ZMXA== X-Gm-Message-State: AOAM5330hjmsC6xBVY6QP7BLJTlTw/1UImvlVhxfi/F1FDQkfeF8qfC+ OYL9lbiEyhif/SILmbUf8SjDGAtFsiax2Q== X-Google-Smtp-Source: ABdhPJyM1TKjd9JMUqt//1GI7lrNEQypGJAgZzIveIURrs1X+yhyVuY45eig3ddrcEZZNhRvE203jA== X-Received: by 2002:a17:907:7ea5:b0:6e1:13c3:e35f with SMTP id qb37-20020a1709077ea500b006e113c3e35fmr10575266ejc.99.1648562180261; Tue, 29 Mar 2022 06:56:20 -0700 (PDT) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id ds5-20020a170907724500b006df8f39dadesm7006601ejc.218.2022.03.29.06.56.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Mar 2022 06:56:19 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , Han Xin , Jiang Xin , =?utf-8?q?Ren=C3=A9_Scharfe?= , Derrick Stolee , Philip Oakley , Neeraj Singh , Elijah Newren , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v12 2/8] object-file.c: do fsync() and close() before post-write die() Date: Tue, 29 Mar 2022 15:56:07 +0200 Message-Id: X-Mailer: git-send-email 2.35.1.1548.g36973b18e52 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Change write_loose_object() to do an fsync() and close() before the oideq() sanity check at the end. This change re-joins code that was split up by the die() sanity check added in 748af44c63e (sha1_file: be paranoid when creating loose objects, 2010-02-21). I don't think that this change matters in itself, if we called die() it was possible that our data wouldn't fully make it to disk, but in any case we were writing data that we'd consider corrupted. It's possible that a subsequent "git fsck" will be less confused now. The real reason to make this change is that in a subsequent commit we'll split this code in write_loose_object() into a utility function, all its callers will want the preceding sanity checks, but not the "oideq" check. By moving the close_loose_object() earlier it'll be easier to reason about the introduction of the utility function. Signed-off-by: Ævar Arnfjörð Bjarmason --- object-file.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/object-file.c b/object-file.c index 62ebe236c90..5da458eccbf 100644 --- a/object-file.c +++ b/object-file.c @@ -1886,7 +1886,14 @@ void hash_object_file(const struct git_hash_algo *algo, const void *buf, hash_object_file_literally(algo, buf, len, type_name(type), oid); } -/* Finalize a file on disk, and close it. */ +/* + * We already did a write_buffer() to the "fd", let's fsync() + * and close(). + * + * Finalize a file on disk, and close it. We might still die() on a + * subsequent sanity check, but let's not add to that confusion by not + * flushing any outstanding writes to disk first. + */ static void close_loose_object(int fd) { if (the_repository->objects->odb->will_destroy) @@ -2006,12 +2013,12 @@ static int write_loose_object(const struct object_id *oid, char *hdr, die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), ret); the_hash_algo->final_oid_fn(¶no_oid, &c); + close_loose_object(fd); + if (!oideq(oid, ¶no_oid)) die(_("confused by unstable object source data for %s"), oid_to_hex(oid)); - close_loose_object(fd); - if (mtime) { struct utimbuf utb; utb.actime = mtime; From patchwork Tue Mar 29 13:56:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12794881 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40925C433F5 for ; Tue, 29 Mar 2022 13:56:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237626AbiC2N6M (ORCPT ); Tue, 29 Mar 2022 09:58:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41584 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237612AbiC2N6G (ORCPT ); Tue, 29 Mar 2022 09:58:06 -0400 Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 566BA223BE9 for ; Tue, 29 Mar 2022 06:56:23 -0700 (PDT) Received: by mail-ej1-x634.google.com with SMTP id o10so35364939ejd.1 for ; Tue, 29 Mar 2022 06:56:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=QIVuPPaIkwh901GDFQRzl6dW0h/Jnm3tQ4lUvqCLjHo=; b=jjOeoU6msPxtJdUI379bJwtiYR5K6MvRfx3SYxlpnuNKK3wINLgUHtXpOV+CjFz4UR XRGMj7NYEb6azdxWreRHAliJeGpXmMU1hYEkewOyuv9ztdSI0KM0v/sPswxPkxsMbcq1 iSj5uws4hB4smAaHXTPKMt2zTe2i9u6p4gmanAw37J4vIh4jCR1duaxJhITBsyXjcn9v xBP8rSor4jh+S3l75bM5W/4ZZMKtXIV/wycsMMr55XhsiJQ939IWAPLtZvHRaw2i13Fj 6wtk5fR+mzGMncFgDd6uc6orUY8k8REe9luUX76wBZgeE8F2lVIg+9G4+waqwK6cSl4p 7JDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=QIVuPPaIkwh901GDFQRzl6dW0h/Jnm3tQ4lUvqCLjHo=; b=KF+4KN7Xw81uFZMjXkoDFSD8G8cO9IMN/bhiZgRvGRrPoulZ/4J48OQaIbbxbnnvA5 LrU9tvAATi5LhGy4U5kJfk9ISPtCLJ25tke3gppenyAJrzH6+oEjFL6R6FCnv4ffDRKx 1QR6mTPZPM9Xt7PC3k6GO8HxK77zS4iwsnDv6d8u3ErBZPuRoSQp6uenZsE9GPTZU4tV skL+3EwwH77zvU8BkQRUD52xCmTe7K/5OGGS8PAMM3U2ZxtHnsxPIw8t9rQhbCTmF5II Z3ToG4G5KJw73rjIsXYYMC1fftLDV42z7VygiAwx+KCLVlKaoTHvwBKX3lvqKW4WfFTr dscw== X-Gm-Message-State: AOAM532kWefNr5GncJowqCvfaTVHbNIW47bIJw7huTVoQe9IL0BmPPxF RU9kP7NhjnGT+Bt4A3zbhCz5Wk7VIKZIFA== X-Google-Smtp-Source: ABdhPJy7fhT62sdyS2xM5rfSvFE6ardshErINB048mZUI4+N3uCdkTy0cYNm3CuEV7yjy0VVtvJX+g== X-Received: by 2002:a17:907:7255:b0:6df:e82b:3e89 with SMTP id ds21-20020a170907725500b006dfe82b3e89mr34537607ejc.493.1648562181516; Tue, 29 Mar 2022 06:56:21 -0700 (PDT) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id ds5-20020a170907724500b006df8f39dadesm7006601ejc.218.2022.03.29.06.56.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Mar 2022 06:56:20 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , Han Xin , Jiang Xin , =?utf-8?q?Ren=C3=A9_Scharfe?= , Derrick Stolee , Philip Oakley , Neeraj Singh , Elijah Newren , Han Xin , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFy?= =?utf-8?b?bWFzb24=?= , Jiang Xin Subject: [PATCH v12 3/8] object-file.c: refactor write_loose_object() to several steps Date: Tue, 29 Mar 2022 15:56:08 +0200 Message-Id: X-Mailer: git-send-email 2.35.1.1548.g36973b18e52 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin When writing a large blob using "write_loose_object()", we have to pass a buffer with the whole content of the blob, and this behavior will consume lots of memory and may cause OOM. We will introduce a stream version function ("stream_loose_object()") in later commit to resolve this issue. Before introducing that streaming function, do some refactoring on "write_loose_object()" to reuse code for both versions. Rewrite "write_loose_object()" as follows: 1. Figure out a path for the (temp) object file. This step is only used in "write_loose_object()". 2. Move common steps for starting to write loose objects into a new function "start_loose_object_common()". 3. Compress data. 4. Move common steps for ending zlib stream into a new function "end_loose_object_common()". 5. Close fd and finalize the object file. Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Jiang Xin Signed-off-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- object-file.c | 102 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 76 insertions(+), 26 deletions(-) diff --git a/object-file.c b/object-file.c index 5da458eccbf..7f160929e00 100644 --- a/object-file.c +++ b/object-file.c @@ -1955,6 +1955,75 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename) return fd; } +/** + * Common steps for loose object writers to start writing loose + * objects: + * + * - Create tmpfile for the loose object. + * - Setup zlib stream for compression. + * - Start to feed header to zlib stream. + * + * Returns a "fd", which should later be provided to + * end_loose_object_common(). + */ +static int start_loose_object_common(struct strbuf *tmp_file, + const char *filename, unsigned flags, + git_zstream *stream, + unsigned char *buf, size_t buflen, + git_hash_ctx *c, + char *hdr, int hdrlen) +{ + int fd; + + fd = create_tmpfile(tmp_file, filename); + if (fd < 0) { + if (flags & HASH_SILENT) + return -1; + else if (errno == EACCES) + return error(_("insufficient permission for adding " + "an object to repository database %s"), + get_object_directory()); + else + return error_errno( + _("unable to create temporary file")); + } + + /* Setup zlib stream for compression */ + git_deflate_init(stream, zlib_compression_level); + stream->next_out = buf; + stream->avail_out = buflen; + the_hash_algo->init_fn(c); + + /* Start to feed header to zlib stream */ + stream->next_in = (unsigned char *)hdr; + stream->avail_in = hdrlen; + while (git_deflate(stream, 0) == Z_OK) + ; /* nothing */ + the_hash_algo->update_fn(c, hdr, hdrlen); + + return fd; +} + +/** + * Common steps for loose object writers to end writing loose objects: + * + * - End the compression of zlib stream. + * - Get the calculated oid to "oid". + * - fsync() and close() the "fd" + */ +static int end_loose_object_common(git_hash_ctx *c, git_zstream *stream, + struct object_id *oid) +{ + int ret; + + ret = git_deflate_end_gently(stream); + if (ret != Z_OK) + return ret; + the_hash_algo->final_oid_fn(oid, c); + + return Z_OK; +} + static int write_loose_object(const struct object_id *oid, char *hdr, int hdrlen, const void *buf, unsigned long len, time_t mtime, unsigned flags) @@ -1969,28 +2038,11 @@ static int write_loose_object(const struct object_id *oid, char *hdr, loose_object_path(the_repository, &filename, oid); - fd = create_tmpfile(&tmp_file, filename.buf); - if (fd < 0) { - if (flags & HASH_SILENT) - return -1; - else if (errno == EACCES) - return error(_("insufficient permission for adding an object to repository database %s"), get_object_directory()); - else - return error_errno(_("unable to create temporary file")); - } - - /* Set it up */ - git_deflate_init(&stream, zlib_compression_level); - stream.next_out = compressed; - stream.avail_out = sizeof(compressed); - the_hash_algo->init_fn(&c); - - /* First header.. */ - stream.next_in = (unsigned char *)hdr; - stream.avail_in = hdrlen; - while (git_deflate(&stream, 0) == Z_OK) - ; /* nothing */ - the_hash_algo->update_fn(&c, hdr, hdrlen); + fd = start_loose_object_common(&tmp_file, filename.buf, flags, + &stream, compressed, sizeof(compressed), + &c, hdr, hdrlen); + if (fd < 0) + return -1; /* Then the data itself.. */ stream.next_in = (void *)buf; @@ -2008,11 +2060,9 @@ static int write_loose_object(const struct object_id *oid, char *hdr, if (ret != Z_STREAM_END) die(_("unable to deflate new object %s (%d)"), oid_to_hex(oid), ret); - ret = git_deflate_end_gently(&stream); + ret = end_loose_object_common(&c, &stream, ¶no_oid); if (ret != Z_OK) - die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), - ret); - the_hash_algo->final_oid_fn(¶no_oid, &c); + die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), ret); close_loose_object(fd); if (!oideq(oid, ¶no_oid)) From patchwork Tue Mar 29 13:56:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12794882 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9595FC433F5 for ; Tue, 29 Mar 2022 13:56:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237628AbiC2N6O (ORCPT ); Tue, 29 Mar 2022 09:58:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41604 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237613AbiC2N6G (ORCPT ); Tue, 29 Mar 2022 09:58:06 -0400 Received: from mail-ed1-x530.google.com (mail-ed1-x530.google.com [IPv6:2a00:1450:4864:20::530]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 370B82241C8 for ; Tue, 29 Mar 2022 06:56:24 -0700 (PDT) Received: by mail-ed1-x530.google.com with SMTP id h4so13032580edr.3 for ; Tue, 29 Mar 2022 06:56:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=UVK5E1lv48xNvHwAyg6lZY+gFAjM+SJfzs7vh1Z6EIk=; b=JugYAzgmeol/rNigpCNPiKLK7D1xl4QuOQ5kzdE7RBB9cN/U8bVQSa3tnJiKRIxJov iim/F163DTUFClqSp6+2oku9hQrCf24E6YmD+LFCT7BMxi6vRmSC49T6nzAwCCbuZRX0 BiEmEkP8SEo1SOJiP3x9bCneSiJmlQ1y5VcFx8y+qoaWSTx9tEsd7FqAIO5ABnT/VMxY x/omuDvl4mmHV16UVCMbBWOCM3AftUvXAHF9agxeZO68+0Tk34nRpaXbKbkEmgqL4hxv zp8idB5czlG9GS3e53YKg/jzXQSjACTu0MNVFXYMyrKXfSt0WGgWVDIsa+ZMIk0On2p2 2w3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=UVK5E1lv48xNvHwAyg6lZY+gFAjM+SJfzs7vh1Z6EIk=; b=T/bsjBgMO4fPP1wo2FOwFTP36drSCOE1c+3jP9wmNfI5rQsLnWUlY+s8Tudci87IbW GGL/Fl//bkdF7E2x1l+GNWldhjapQ5oFQADXy/H9kkmfidqog9Fqc4ZpTwnl6bPZJl8C UmG9YKH0QxW3Xqg8KsSxFj/nskFvywRhCqx+WukMPADehtLQz2SulNgN9bLIsDioV8jG Um1BxfduQDTJRUUnb+OxB/qCb06y50gOmWhGpA7E+Nnw/lnvD3HpmgUA2wyxutF6jaLR 2w8YN4MHWHd60+nE+OsNzy1nSCHHeJ0sFhCz71iCb+ze84HGebgUh3Bg1rl9Rdn5uh9J YVMw== X-Gm-Message-State: AOAM533gaBlNNJ45u9wB9Lf76eI+NJ8FxjzHWCsRYlJeQA3s9q371HaZ djPEnoiAGTStI8AR287lL09ByEvhBpHJxA== X-Google-Smtp-Source: ABdhPJyTR845ZBxp1v+7Zstay8KYJ0V+WvWgZsrTlBPVSz3rUD1thotwxqT1aUVe+xkE3Qrr2yoFbA== X-Received: by 2002:a05:6402:d0e:b0:418:f011:275e with SMTP id eb14-20020a0564020d0e00b00418f011275emr4673014edb.323.1648562182537; Tue, 29 Mar 2022 06:56:22 -0700 (PDT) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id ds5-20020a170907724500b006df8f39dadesm7006601ejc.218.2022.03.29.06.56.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Mar 2022 06:56:21 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , Han Xin , Jiang Xin , =?utf-8?q?Ren=C3=A9_Scharfe?= , Derrick Stolee , Philip Oakley , Neeraj Singh , Elijah Newren , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v12 4/8] object-file.c: factor out deflate part of write_loose_object() Date: Tue, 29 Mar 2022 15:56:09 +0200 Message-Id: X-Mailer: git-send-email 2.35.1.1548.g36973b18e52 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Split out the part of write_loose_object() that deals with calling git_deflate() into a utility function, a subsequent commit will introduce another function that'll make use of it. Signed-off-by: Ævar Arnfjörð Bjarmason --- object-file.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/object-file.c b/object-file.c index 7f160929e00..6e2f2264f8c 100644 --- a/object-file.c +++ b/object-file.c @@ -2004,6 +2004,28 @@ static int start_loose_object_common(struct strbuf *tmp_file, return fd; } +/** + * Common steps for the inner git_deflate() loop for writing loose + * objects. Returns what git_deflate() returns. + */ +static int write_loose_object_common(git_hash_ctx *c, + git_zstream *stream, const int flush, + unsigned char *in0, const int fd, + unsigned char *compressed, + const size_t compressed_len) +{ + int ret; + + ret = git_deflate(stream, flush ? Z_FINISH : 0); + the_hash_algo->update_fn(c, in0, stream->next_in - in0); + if (write_buffer(fd, compressed, stream->next_out - compressed) < 0) + die(_("unable to write loose object file")); + stream->next_out = compressed; + stream->avail_out = compressed_len; + + return ret; +} + /** * Common steps for loose object writers to end writing loose objects: * @@ -2049,12 +2071,9 @@ static int write_loose_object(const struct object_id *oid, char *hdr, stream.avail_in = len; do { unsigned char *in0 = stream.next_in; - ret = git_deflate(&stream, Z_FINISH); - the_hash_algo->update_fn(&c, in0, stream.next_in - in0); - if (write_buffer(fd, compressed, stream.next_out - compressed) < 0) - die(_("unable to write loose object file")); - stream.next_out = compressed; - stream.avail_out = sizeof(compressed); + + ret = write_loose_object_common(&c, &stream, 1, in0, fd, + compressed, sizeof(compressed)); } while (ret == Z_OK); if (ret != Z_STREAM_END) From patchwork Tue Mar 29 13:56:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12794883 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B342C433F5 for ; Tue, 29 Mar 2022 13:56:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237638AbiC2N6X (ORCPT ); Tue, 29 Mar 2022 09:58:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41748 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237600AbiC2N6I (ORCPT ); Tue, 29 Mar 2022 09:58:08 -0400 Received: from mail-ed1-x52a.google.com (mail-ed1-x52a.google.com [IPv6:2a00:1450:4864:20::52a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8CB87223BE9 for ; Tue, 29 Mar 2022 06:56:25 -0700 (PDT) Received: by mail-ed1-x52a.google.com with SMTP id b24so20783431edu.10 for ; Tue, 29 Mar 2022 06:56:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=lsk4iODBcOZix63B1bkHB+Y795PZHBXpXZbo0cLolCs=; b=kbM6LL58Mr/7AQQmnskZFVDMl2/54eLNZuW/+lhFDpyjEhj04Cby60Byl45NagI+5Z fgDjzJ3zhFtEX6/K1zjzTgpY/HAc94yc6b7DFmJhnJiB3Do0JPyT495VBERZnME9VAGo 0PgYncF1V0E7Tn1bvXsu/S+gdLiwRbL5E80FRpDt4u5KU2iSM25T/4yPbEhUSp55bW5C 4E8uzHjRZKvuzXMfRR8AQ6krNT9a/prWpJGnL2D77qPEd5xCvK1FAKRfaQSNNSnSIqhk NOSWzO3ZV1r2iJEVesHbZ2cJzDgwKI5MRNw9xV/4G9PBuQcoNurAXqpzCU1/t1IHI42v Sb/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=lsk4iODBcOZix63B1bkHB+Y795PZHBXpXZbo0cLolCs=; b=01R5tvv8JRdKp0MV+wi6x77Y0l/gfaQb6ziMq5XdSTDTctXDejFbRdQdKhJX1HEdZ6 pEAfHHxAcL1tJl38AMjALa11xdZhITp8j32DHkCrOymg7FzLMhP1vm5zswrxkoDga4rI Y44Pl8ty35jyQENCOP0UJGQsNy0Z11yayTaev8YjBt9NH7LcR+tyZYZMJlrieFCe+J4e GFfdbQakzccAKnqQ21VBq67MmaPJAlsXAeIMdOwepyLJBHBgswt0wXfaYRan6L4H9sHY sMXVVT7gx6cxHs1qaDD8xD+jMEgYBLtgX3LhstnCuaLTzGO2rxtQkyejMj2JlE4lkAEA h1Dg== X-Gm-Message-State: AOAM530fkvPiW7fj3Jv049nOdxbHcJ2iZ/ascTC9X11z065YaA37irvB Fy2QVmPfRF3HFtQknSuDy9rh7RhG0d6MLQ== X-Google-Smtp-Source: ABdhPJy2Ss6jJtj3ZZc0Fgx0xZtoaCVL1TsErXaXDiBAOAEDWuVOI3phRcg74VuWX0nyDEWp2YU3ow== X-Received: by 2002:a05:6402:1107:b0:416:439a:6a9e with SMTP id u7-20020a056402110700b00416439a6a9emr4654649edv.382.1648562183781; Tue, 29 Mar 2022 06:56:23 -0700 (PDT) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id ds5-20020a170907724500b006df8f39dadesm7006601ejc.218.2022.03.29.06.56.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Mar 2022 06:56:23 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , Han Xin , Jiang Xin , =?utf-8?q?Ren=C3=A9_Scharfe?= , Derrick Stolee , Philip Oakley , Neeraj Singh , Elijah Newren , Han Xin , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFy?= =?utf-8?b?bWFzb24=?= , Jiang Xin Subject: [PATCH v12 5/8] object-file.c: add "stream_loose_object()" to handle large object Date: Tue, 29 Mar 2022 15:56:10 +0200 Message-Id: X-Mailer: git-send-email 2.35.1.1548.g36973b18e52 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin If we want unpack and write a loose object using "write_loose_object", we have to feed it with a buffer with the same size of the object, which will consume lots of memory and may cause OOM. This can be improved by feeding data to "stream_loose_object()" in a stream. Add a new function "stream_loose_object()", which is a stream version of "write_loose_object()" but with a low memory footprint. We will use this function to unpack large blob object in later commit. Another difference with "write_loose_object()" is that we have no chance to run "write_object_file_prepare()" to calculate the oid in advance. In "write_loose_object()", we know the oid and we can write the temporary file in the same directory as the final object, but for an object with an undetermined oid, we don't know the exact directory for the object. Still, we need to save the temporary file we're preparing somewhere. We'll do that in the top-level ".git/objects/" directory (or whatever "GIT_OBJECT_DIRECTORY" is set to). Once we've streamed it we'll know the OID, and will move it to its canonical path. "freshen_packed_object()" or "freshen_loose_object()" will be called inside "stream_loose_object()" after obtaining the "oid". Helped-by: René Scharfe Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Jiang Xin Signed-off-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- object-file.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++ object-store.h | 8 ++++ 2 files changed, 108 insertions(+) diff --git a/object-file.c b/object-file.c index 6e2f2264f8c..2be2bae9afa 100644 --- a/object-file.c +++ b/object-file.c @@ -2118,6 +2118,106 @@ static int freshen_packed_object(const struct object_id *oid) return 1; } +int stream_loose_object(struct input_stream *in_stream, size_t len, + struct object_id *oid) +{ + int fd, ret, err = 0, flush = 0; + unsigned char compressed[4096]; + git_zstream stream; + git_hash_ctx c; + struct strbuf tmp_file = STRBUF_INIT; + struct strbuf filename = STRBUF_INIT; + int dirlen; + char hdr[MAX_HEADER_LEN]; + int hdrlen; + + /* Since oid is not determined, save tmp file to odb path. */ + strbuf_addf(&filename, "%s/", get_object_directory()); + hdrlen = format_object_header(hdr, sizeof(hdr), OBJ_BLOB, len); + + /* + * Common steps for write_loose_object and stream_loose_object to + * start writing loose objects: + * + * - Create tmpfile for the loose object. + * - Setup zlib stream for compression. + * - Start to feed header to zlib stream. + */ + fd = start_loose_object_common(&tmp_file, filename.buf, 0, + &stream, compressed, sizeof(compressed), + &c, hdr, hdrlen); + if (fd < 0) { + err = -1; + goto cleanup; + } + + /* Then the data itself.. */ + do { + unsigned char *in0 = stream.next_in; + + if (!stream.avail_in && !in_stream->is_finished) { + const void *in = in_stream->read(in_stream, &stream.avail_in); + stream.next_in = (void *)in; + in0 = (unsigned char *)in; + /* All data has been read. */ + if (in_stream->is_finished) + flush = 1; + } + ret = write_loose_object_common(&c, &stream, flush, in0, fd, + compressed, sizeof(compressed)); + /* + * Unlike write_loose_object(), we do not have the entire + * buffer. If we get Z_BUF_ERROR due to too few input bytes, + * then we'll replenish them in the next input_stream->read() + * call when we loop. + */ + } while (ret == Z_OK || ret == Z_BUF_ERROR); + + if (stream.total_in != len + hdrlen) + die(_("write stream object %ld != %"PRIuMAX), stream.total_in, + (uintmax_t)len + hdrlen); + + /* Common steps for write_loose_object and stream_loose_object to + * end writing loose oject: + * + * - End the compression of zlib stream. + * - Get the calculated oid. + */ + if (ret != Z_STREAM_END) + die(_("unable to stream deflate new object (%d)"), ret); + ret = end_loose_object_common(&c, &stream, oid); + if (ret != Z_OK) + die(_("deflateEnd on stream object failed (%d)"), ret); + close_loose_object(fd); + + if (freshen_packed_object(oid) || freshen_loose_object(oid)) { + unlink_or_warn(tmp_file.buf); + goto cleanup; + } + + loose_object_path(the_repository, &filename, oid); + + /* We finally know the object path, and create the missing dir. */ + dirlen = directory_size(filename.buf); + if (dirlen) { + struct strbuf dir = STRBUF_INIT; + strbuf_add(&dir, filename.buf, dirlen); + + if (mkdir_in_gitdir(dir.buf) && errno != EEXIST) { + err = error_errno(_("unable to create directory %s"), dir.buf); + strbuf_release(&dir); + goto cleanup; + } + strbuf_release(&dir); + } + + err = finalize_object_file(tmp_file.buf, filename.buf); +cleanup: + strbuf_release(&tmp_file); + strbuf_release(&filename); + return err; +} + int write_object_file_flags(const void *buf, unsigned long len, enum object_type type, struct object_id *oid, unsigned flags) diff --git a/object-store.h b/object-store.h index bd2322ed8ce..1099455bc2e 100644 --- a/object-store.h +++ b/object-store.h @@ -46,6 +46,12 @@ struct object_directory { char *path; }; +struct input_stream { + const void *(*read)(struct input_stream *, unsigned long *len); + void *data; + int is_finished; +}; + KHASH_INIT(odb_path_map, const char * /* key: odb_path */, struct object_directory *, 1, fspathhash, fspatheq) @@ -261,6 +267,8 @@ static inline int write_object_file(const void *buf, unsigned long len, int write_object_file_literally(const void *buf, unsigned long len, const char *type, struct object_id *oid, unsigned flags); +int stream_loose_object(struct input_stream *in_stream, size_t len, + struct object_id *oid); /* * Add an object file to the in-memory object store, without writing it From patchwork Tue Mar 29 13:56:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12794884 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 461FDC433FE for ; Tue, 29 Mar 2022 13:56:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237640AbiC2N6Y (ORCPT ); Tue, 29 Mar 2022 09:58:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237617AbiC2N6J (ORCPT ); Tue, 29 Mar 2022 09:58:09 -0400 Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com [IPv6:2a00:1450:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 83AF9224743 for ; Tue, 29 Mar 2022 06:56:26 -0700 (PDT) Received: by mail-ej1-x62e.google.com with SMTP id pv16so35381483ejb.0 for ; Tue, 29 Mar 2022 06:56:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=b1ulR9U5v7pVV4iT8TQeISOfZLlmxSyVb0eBhpXz4lU=; b=ofQZK/D6bNDVTqYNrWQHmfTxDbOfU+FV6d7WRMaYWUAjGIJddTcG1iUs+RW6+5dqmn gaIK/w/+KMJW6Kf1ZabVa4hAPPeMON+hlx7KhRB1HUMaJdQpt0qdMG2NtQ6sauipHLFT iEg23g7xdEOKUSWJlwH4WR5g4Mhpi+rlToawWN2bJ4oWjsYyODZcCTcfg/5jJ1JnFBo+ BVZhM0DYyO+NLVZRasIEBKg4QgZZgMeUHFsJLpm7zOzjO0pnrHXGMs9He+SU4501ozDM FIbXg1Kdx6S7os2Jne1JrVpAHfpLdrRbPx7OXONNB9401tv9uKbLX1eWpxK9bAMy7MiJ tKRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=b1ulR9U5v7pVV4iT8TQeISOfZLlmxSyVb0eBhpXz4lU=; b=nbBkGM5DY1tEJ3t2K2Lybq8Zof/weYn9GkDkI96dkYk6qsHTHNTIotWwYqOnwJbI1d EtNuk8a7CGMJ8ZOlK22XyDeOnUTy08y2C9SrykHyalmer0KEIZefZyuG6y4zWGivdUcD XqMElgm2JTieQXnQugyPvEC76J0FQT4J51yvF6TB6rJsk+XXJ6fmGx+3gU8JBJoSuzq9 YuIIY+bXfiK7kfEQ+FceveA/prV/1EYet3nncHqGYcfCObYaFMwsH4a1H4gIKwv7G3wc MkvV6EKmt5xvMa1Q/rfwe/lSW56hyQt9br0bWCMeDQWQuw7ytqwuyV7TUbqjkdWhh2a3 dzAg== X-Gm-Message-State: AOAM533RfngCpqwQ9qOqHtmPApZDqj3S3Gzme7KekY24tjprOJnIicNA pou2oHbQlvpByYW5Py4/vLJF/cuXhhzKvA== X-Google-Smtp-Source: ABdhPJyKN/2vcJOqrGQLSTXaVRDuEN4GcdRL0vUsoH7n2QNyaW0GLN73haeED4YnvCsipjpTJGO4MQ== X-Received: by 2002:a17:907:2d0a:b0:6df:8bc8:236f with SMTP id gs10-20020a1709072d0a00b006df8bc8236fmr34262176ejc.527.1648562184786; Tue, 29 Mar 2022 06:56:24 -0700 (PDT) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id ds5-20020a170907724500b006df8f39dadesm7006601ejc.218.2022.03.29.06.56.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Mar 2022 06:56:24 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , Han Xin , Jiang Xin , =?utf-8?q?Ren=C3=A9_Scharfe?= , Derrick Stolee , Philip Oakley , Neeraj Singh , Elijah Newren , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v12 6/8] core doc: modernize core.bigFileThreshold documentation Date: Tue, 29 Mar 2022 15:56:11 +0200 Message-Id: X-Mailer: git-send-email 2.35.1.1548.g36973b18e52 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org The core.bigFileThreshold documentation has been largely unchanged since 5eef828bc03 (fast-import: Stream very large blobs directly to pack, 2010-02-01). But since then this setting has been expanded to affect a lot more than that description indicated. Most notably in how "git diff" treats them, see 6bf3b813486 (diff --stat: mark any file larger than core.bigfilethreshold binary, 2014-08-16). In addition to that, numerous commands and APIs make use of a streaming mode for files above this threshold. So let's attempt to summarize 12 years of changes in behavior, which can be seen with: git log --oneline -Gbig_file_thre 5eef828bc03.. -- '*.c' To do that turn this into a bullet-point list. The summary Han Xin produced in [1] helped a lot, but is a bit too detailed for documentation aimed at users. Let's instead summarize how user-observable behavior differs, and generally describe how we tend to stream these files in various commands. 1. https://lore.kernel.org/git/20220120112114.47618-5-chiyutianyi@gmail.com/ Helped-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- Documentation/config/core.txt | 33 ++++++++++++++++++++++++--------- 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index 9da3e5d88f6..5fccbd56995 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -412,17 +412,32 @@ You probably do not need to adjust this value. Common unit suffixes of 'k', 'm', or 'g' are supported. core.bigFileThreshold:: - Files larger than this size are stored deflated, without - attempting delta compression. Storing large files without - delta compression avoids excessive memory usage, at the - slight expense of increased disk usage. Additionally files - larger than this size are always treated as binary. + The size of files considered "big", which as discussed below + changes the behavior of numerous git commands, as well as how + such files are stored within the repository. The default is + 512 MiB. Common unit suffixes of 'k', 'm', or 'g' are + supported. + -Default is 512 MiB on all platforms. This should be reasonable -for most projects as source code and other text files can still -be delta compressed, but larger binary media files won't be. +Files above the configured limit will be: + -Common unit suffixes of 'k', 'm', or 'g' are supported. +* Stored deflated, without attempting delta compression. ++ +The default limit is primarily set with this use-case in mind. With it +most projects will have their source code and other text files delta +compressed, but not larger binary media files. ++ +Storing large files without delta compression avoids excessive memory +usage, at the slight expense of increased disk usage. ++ +* Will be treated as if though they were labeled "binary" (see + linkgit:gitattributes[5]). This means that e.g. linkgit:git-log[1] + and linkgit:git-diff[1] will not diffs for files above this limit. ++ +* Will be generally be streamed when written, which avoids excessive +memory usage, at the cost of some fixed overhead. Commands that make +use of this include linkgit:git-archive[1], +linkgit:git-fast-import[1], linkgit:git-index-pack[1] and +linkgit:git-fsck[1]. core.excludesFile:: Specifies the pathname to the file that contains patterns to From patchwork Tue Mar 29 13:56:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12794885 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F338BC433EF for ; Tue, 29 Mar 2022 13:56:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237616AbiC2N63 (ORCPT ); Tue, 29 Mar 2022 09:58:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41866 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237618AbiC2N6K (ORCPT ); Tue, 29 Mar 2022 09:58:10 -0400 Received: from mail-ed1-x52f.google.com (mail-ed1-x52f.google.com [IPv6:2a00:1450:4864:20::52f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 687732261D7 for ; Tue, 29 Mar 2022 06:56:27 -0700 (PDT) Received: by mail-ed1-x52f.google.com with SMTP id r23so20836582edb.0 for ; Tue, 29 Mar 2022 06:56:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=hruYv5+WyjMiT/86Z+ARfp7aF0DLBAtnNmWREdl/4vA=; b=EG0L3JyG1C5V6j3A0eu3A+aiBn+F6iP1olXXQ0Mxs49fnM4KzsamtD9V2zuudT0pP4 5u0QBqXXzlEBiVFqZtfCUTfZgN2J88k0JXUwyUtUxeiat/xYIsV8YxKeN0jrIfh3Kd5T P2fVmngCBM5uYFCZCw+rzO4qLmZWNen2cylgcHmkUvMhwS2Or5ARKvu03NPSQ7SvBqkf hTcQDxV8VPOa50ufvt8KmJhOKQoUHc1KfMHTSQGltJk/T56Qg2bPQO3dKtn8ZPa+ZXp+ yY/OX57FcsRYh3KIW8B9GKO/k70Dvz7z/aZUobEyzPmvM40zPlLZYLb0VQY6NAEJMXgQ HDoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=hruYv5+WyjMiT/86Z+ARfp7aF0DLBAtnNmWREdl/4vA=; b=gSftPVSPL/T2a+hkBaoQUpCri3+PJP0S3Kq5CFtt7iKhlXON4OpF+InzvMKd8UR41k SuBl+ANRGU7/mqsUkSNY26o86wfTxR2/dnXJdsycWfYr3foKU6R0hPLYN98fMITLxqiW thk1mf95xOtuqovEk3bLn0/bAR43OVi5bHlohc5ALj5erwwYhYnAregRpm9bPHLjXEHH f7vZe4ngsH38cqctyvtVrPTWBTKlBfY9xZgTtrUtdxix7ePI+t5GqE57Ad0vrDINqIhY Szajfh+8m7yossT6rbMgzig3DZfkdvXHlEL2O57V9ZL+2i6Ekppot+FZ5dNHlxxZFQ8v 680g== X-Gm-Message-State: AOAM530RbCBZj7gcu3fx5ZF70+n3V/6wwSbSN/WZtdGd2NkM9A7hhdAi Kmtq31mnG3kEutPmZrKI94DRL8JG89qXxw== X-Google-Smtp-Source: ABdhPJyIepZnTIwbHgGNYLY1SiY85AdHdlP8A9/YE7Sl82vV/Z10vyDUIjbZObIARg5lUzZxqV8s3A== X-Received: by 2002:a05:6402:51cf:b0:419:63e2:2b96 with SMTP id r15-20020a05640251cf00b0041963e22b96mr4742125edd.336.1648562185776; Tue, 29 Mar 2022 06:56:25 -0700 (PDT) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id ds5-20020a170907724500b006df8f39dadesm7006601ejc.218.2022.03.29.06.56.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Mar 2022 06:56:25 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , Han Xin , Jiang Xin , =?utf-8?q?Ren=C3=A9_Scharfe?= , Derrick Stolee , Philip Oakley , Neeraj Singh , Elijah Newren , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v12 7/8] unpack-objects: refactor away unpack_non_delta_entry() Date: Tue, 29 Mar 2022 15:56:12 +0200 Message-Id: X-Mailer: git-send-email 2.35.1.1548.g36973b18e52 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org The unpack_one() function will call either a non-trivial unpack_delta_entry() or a trivial unpack_non_delta_entry(). Let's inline the latter in the only caller. Since 21666f1aae4 (convert object type handling from a string to a number, 2007-02-26) the unpack_non_delta_entry() function has been rather trivial, and in a preceding commit the "dry_run" condition it was handling went away. This is not done as an optimization, as the compiler will easily discover that it can do the same, rather this makes a subsequent commit easier to reason about. As it'll be handling "OBJ_BLOB" in a special manner let's re-arrange that "case" in preparation for that change. Signed-off-by: Ævar Arnfjörð Bjarmason --- builtin/unpack-objects.c | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index e3d30025979..d374599d544 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -338,15 +338,6 @@ static void added_object(unsigned nr, enum object_type type, } } -static void unpack_non_delta_entry(enum object_type type, unsigned long size, - unsigned nr) -{ - void *buf = get_data(size); - - if (buf) - write_object(nr, type, buf, size); -} - static int resolve_against_held(unsigned nr, const struct object_id *base, void *delta_data, unsigned long delta_size) { @@ -479,12 +470,17 @@ static void unpack_one(unsigned nr) } switch (type) { + case OBJ_BLOB: case OBJ_COMMIT: case OBJ_TREE: - case OBJ_BLOB: case OBJ_TAG: - unpack_non_delta_entry(type, size, nr); + { + void *buf = get_data(size); + + if (buf) + write_object(nr, type, buf, size); return; + } case OBJ_REF_DELTA: case OBJ_OFS_DELTA: unpack_delta_entry(type, size, nr); From patchwork Tue Mar 29 13:56:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12794886 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9EBABC433EF for ; Tue, 29 Mar 2022 13:56:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235227AbiC2N6b (ORCPT ); Tue, 29 Mar 2022 09:58:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42662 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237633AbiC2N6V (ORCPT ); Tue, 29 Mar 2022 09:58:21 -0400 Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8AEA32274CB for ; Tue, 29 Mar 2022 06:56:28 -0700 (PDT) Received: by mail-ed1-x533.google.com with SMTP id w25so20787108edi.11 for ; Tue, 29 Mar 2022 06:56:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Vngh4Fxou6rwEF8RDt4p4btz9ytXiOaeMZ8l2sI30dw=; b=Akq81DgH3n8KLZDQsuLhmW+qCBwHcwQ/+iWNyNGLJY+Ubw1C21xzVNgDyCzg7bYJx2 OAIWjzPG5Jn6j14QTpdQPoDHwlTKhV0elf3hTwqqR76M7YIgpVD9AZeRSZKBUOpChT68 0icHhM4HwJ7Q90pGKwCqVZ537Fb6HZAIO3Bv4jz/cwRNgAEs+QAYZsnGYTAZLE4clUAe 2auGdW9afGUmkOfEQLz4ApDTf+TekX/97zSMNEeIYABcdYyuIFUn0Zaj4v2Pv0zyAMMD MB9gH0tsW6FJE2ZMeH3rp+iMHGbSxnfFAe/Gghai+Sua6TldHpwr5799h7YYW9Up/wfJ rDqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Vngh4Fxou6rwEF8RDt4p4btz9ytXiOaeMZ8l2sI30dw=; b=yviBWTgGMWPuAqutsmc1jGrFw6OJySs7ra1C60UlgYTWRr4akXjAJNm4C81CdQanzX RdpoyJ5KXNMTPUnCi0ezABcgpDJYhCjDBy8tO69TNJDE+rCD+CSa3sAwhsy5yQbCkRQn uxlzCebxxqlO9QMZxwjDwv7Z1U/It74FajrzzP1QGsNJNimBshroq7fm2xANLIHjNBHA rbNVh63/i6rQeXfUT0mpLDFq7vE2pQGP4IgiuG02LtzBvkkFlRmHm11zjJ5JOJ4XLjch ShU9iWGCoY8yUIuqUf7uuZr6jXGqUrYAzXUrwooYBvrZ1csReaNEaLNJ1U2bSqOJlS6B LDuA== X-Gm-Message-State: AOAM531j2oKZcbNh53LVnq3RpCkzqRXqAEnPu/+JiWQemwNTe8Jwd3UC eu5eHwCbbW5UXbieGc7goseY0uxrF7tt7Q== X-Google-Smtp-Source: ABdhPJwlrKbykxF6nQKzeXqEGbah8kzUNrxUn7n6hH+IcxyJcbWdbBG163gji0JDUYG+uwfYvjpYCA== X-Received: by 2002:a50:c3c6:0:b0:416:293f:1f42 with SMTP id i6-20020a50c3c6000000b00416293f1f42mr4680144edf.187.1648562186735; Tue, 29 Mar 2022 06:56:26 -0700 (PDT) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id ds5-20020a170907724500b006df8f39dadesm7006601ejc.218.2022.03.29.06.56.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Mar 2022 06:56:26 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , Han Xin , Jiang Xin , =?utf-8?q?Ren=C3=A9_Scharfe?= , Derrick Stolee , Philip Oakley , Neeraj Singh , Elijah Newren , Han Xin , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFy?= =?utf-8?b?bWFzb24=?= , Jiang Xin Subject: [PATCH v12 8/8] unpack-objects: use stream_loose_object() to unpack large objects Date: Tue, 29 Mar 2022 15:56:13 +0200 Message-Id: X-Mailer: git-send-email 2.35.1.1548.g36973b18e52 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin Make use of the stream_loose_object() function introduced in the preceding commit to unpack large objects. Before this we'd need to malloc() the size of the blob before unpacking it, which could cause OOM with very large blobs. We could use the new streaming interface to unpack all blobs, but doing so would be much slower, as demonstrated e.g. with this benchmark using git-hyperfine[0]: rm -rf /tmp/scalar.git && git clone --bare https://github.com/Microsoft/scalar.git /tmp/scalar.git && mv /tmp/scalar.git/objects/pack/*.pack /tmp/scalar.git/my.pack && git hyperfine \ -r 2 --warmup 1 \ -L rev origin/master,HEAD -L v "10,512,1k,1m" \ -s 'make' \ -p 'git init --bare dest.git' \ -c 'rm -rf dest.git' \ './git -C dest.git -c core.bigFileThreshold={v} unpack-objects &1 | grep Maximum' Using this test we'll always use >100MB of memory on origin/master (around ~105MB), but max out at e.g. ~55MB if we set core.bigFileThreshold=50m. The relevant "Maximum resident set size" lines were manually added below the relevant benchmark: '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects &1 | grep Maximum' in 'origin/master' ran Maximum resident set size (kbytes): 107080 1.02 ± 0.78 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects &1 | grep Maximum' in 'origin/master' Maximum resident set size (kbytes): 106968 1.09 ± 0.79 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects &1 | grep Maximum' in 'origin/master' Maximum resident set size (kbytes): 107032 1.42 ± 1.07 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects &1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 107072 1.83 ± 1.02 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects &1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 55704 2.16 ± 1.19 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects &1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 4564 This shows that if you have enough memory this new streaming method is slower the lower you set the streaming threshold, but the benefit is more bounded memory use. An earlier version of this patch introduced a new "core.bigFileStreamingThreshold" instead of re-using the existing "core.bigFileThreshold" variable[1]. As noted in a detailed overview of its users in [2] using it has several different meanings. Still, we consider it good enough to simply re-use it. While it's possible that someone might want to e.g. consider objects "small" for the purposes of diffing but "big" for the purposes of writing them such use-cases are probably too obscure to worry about. We can always split up "core.bigFileThreshold" in the future if there's a need for that. 0. https://github.com/avar/git-hyperfine/ 1. https://lore.kernel.org/git/20211210103435.83656-1-chiyutianyi@gmail.com/ 2. https://lore.kernel.org/git/20220120112114.47618-5-chiyutianyi@gmail.com/ Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Derrick Stolee Helped-by: Jiang Xin Signed-off-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- Documentation/config/core.txt | 4 +- builtin/unpack-objects.c | 67 +++++++++++++++++++++++++++++++++ t/t5351-unpack-large-objects.sh | 26 +++++++++++-- 3 files changed, 92 insertions(+), 5 deletions(-) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index 5fccbd56995..716259b6762 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -436,8 +436,8 @@ usage, at the slight expense of increased disk usage. * Will be generally be streamed when written, which avoids excessive memory usage, at the cost of some fixed overhead. Commands that make use of this include linkgit:git-archive[1], -linkgit:git-fast-import[1], linkgit:git-index-pack[1] and -linkgit:git-fsck[1]. +linkgit:git-fast-import[1], linkgit:git-index-pack[1], +linkgit:git-unpack-objects[1] and linkgit:git-fsck[1]. core.excludesFile:: Specifies the pathname to the file that contains patterns to diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index d374599d544..9d7b325c23b 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -338,6 +338,68 @@ static void added_object(unsigned nr, enum object_type type, } } +struct input_zstream_data { + git_zstream *zstream; + unsigned char buf[8192]; + int status; +}; + +static const void *feed_input_zstream(struct input_stream *in_stream, + unsigned long *readlen) +{ + struct input_zstream_data *data = in_stream->data; + git_zstream *zstream = data->zstream; + void *in = fill(1); + + if (in_stream->is_finished) { + *readlen = 0; + return NULL; + } + + zstream->next_out = data->buf; + zstream->avail_out = sizeof(data->buf); + zstream->next_in = in; + zstream->avail_in = len; + + data->status = git_inflate(zstream, 0); + + in_stream->is_finished = data->status != Z_OK; + use(len - zstream->avail_in); + *readlen = sizeof(data->buf) - zstream->avail_out; + + return data->buf; +} + +static void stream_blob(unsigned long size, unsigned nr) +{ + git_zstream zstream = { 0 }; + struct input_zstream_data data = { 0 }; + struct input_stream in_stream = { + .read = feed_input_zstream, + .data = &data, + }; + struct obj_info *info = &obj_list[nr]; + + data.zstream = &zstream; + git_inflate_init(&zstream); + + if (stream_loose_object(&in_stream, size, &info->oid)) + die(_("failed to write object in stream")); + + if (data.status != Z_STREAM_END) + die(_("inflate returned (%d)"), data.status); + git_inflate_end(&zstream); + + if (strict) { + struct blob *blob = lookup_blob(the_repository, &info->oid); + + if (!blob) + die(_("invalid blob object from stream")); + blob->object.flags |= FLAG_WRITTEN; + } + info->obj = NULL; +} + static int resolve_against_held(unsigned nr, const struct object_id *base, void *delta_data, unsigned long delta_size) { @@ -471,6 +533,11 @@ static void unpack_one(unsigned nr) switch (type) { case OBJ_BLOB: + if (!dry_run && size > big_file_threshold) { + stream_blob(size, nr); + return; + } + /* fallthrough */ case OBJ_COMMIT: case OBJ_TREE: case OBJ_TAG: diff --git a/t/t5351-unpack-large-objects.sh b/t/t5351-unpack-large-objects.sh index 8d84313221c..461ca060b2b 100755 --- a/t/t5351-unpack-large-objects.sh +++ b/t/t5351-unpack-large-objects.sh @@ -9,7 +9,8 @@ test_description='git unpack-objects with large objects' prepare_dest () { test_when_finished "rm -rf dest.git" && - git init --bare dest.git + git init --bare dest.git && + git -C dest.git config core.bigFileThreshold "$1" } test_expect_success "create large objects (1.5 MB) and PACK" ' @@ -26,16 +27,35 @@ test_expect_success 'set memory limitation to 1MB' ' ' test_expect_success 'unpack-objects failed under memory limitation' ' - prepare_dest && + prepare_dest 2m && test_must_fail git -C dest.git unpack-objects err && grep "fatal: attempting to allocate" err ' test_expect_success 'unpack-objects works with memory limitation in dry-run mode' ' - prepare_dest && + prepare_dest 2m && git -C dest.git unpack-objects -n