From patchwork Sat Jan 8 08:54:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12707412 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08FD0C433F5 for ; Sat, 8 Jan 2022 08:56:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232249AbiAHI4e (ORCPT ); Sat, 8 Jan 2022 03:56:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36142 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231394AbiAHI4c (ORCPT ); Sat, 8 Jan 2022 03:56:32 -0500 Received: from mail-pg1-x52d.google.com (mail-pg1-x52d.google.com [IPv6:2607:f8b0:4864:20::52d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 310B5C061574 for ; Sat, 8 Jan 2022 00:56:32 -0800 (PST) Received: by mail-pg1-x52d.google.com with SMTP id a22so3489552pgd.6 for ; Sat, 08 Jan 2022 00:56:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ushZSffbUUFOEG9JLBU0NCvIS5tYyEI+Bs9jQ0a85hY=; b=HuERdGhN9Jx8FaQdnc0Pd1y1Gmq+RC9hLHL7yopohOwrbpsyO6ASZAxMH/8JO6zZC8 fy3YJDG55OXC0YZtbuLrLXpT+vRbRXkeLt4zZ0QZAIHncAun32hvxpwdce82Ied3hpoH 9+th3PcZtsvCy+FLMaeL8B7/Uco1UDoZgTQXwsaCt1R8smxQ1UhOcR1JX3MAUKcGAfIT vhHHYROLkgrA0JMjXP+F4ymw36WqB7RHtawiU9NTXvu4368foGDMHMmWeCuk6kTHRZ+C 92F4eL+zFup8mE5Z+Nj1M/5xBxyl2jTr9CLBenDpnTNCgvSeZHS9818xUDuEpskh+MIK NKew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ushZSffbUUFOEG9JLBU0NCvIS5tYyEI+Bs9jQ0a85hY=; b=2gykZEs3GCejeIfpM+sayyputwZdWn9vVhTh8AmmObwRuc/dNzrLAGYJYjkJl4hMXv 3/eewvWfPps9VlNUSs9owT0/gQVyZWoaCcEYP8gv6JznQxjMQGz8zIl1vSUGfjYgLYMr kYKKBAq6HWPwI9dOfYetZeAk8rMdkhvLgkPJD2tJ89sIPQI9gaPEHX6qHP8hit22x4qh 5LUKXsSUw1Udqfg/01LBVI+vM+sFtfMOqxjz2FnAE8eeW6ejkbrQHvcUBwWbiDTMeUXL ZaDGc8+grB2lZ/fTJRb5V9nJ+5SWvnZfv3OlmVWtj751qCD+/c0q7bWw1cjQ49ByjEIH N/BA== X-Gm-Message-State: AOAM5319fsn8RwtlNr15LXb4BNCIxXsmJEaCeb2aw/PwtVCJbPPKPzg+ MDYDCPdXDdVWSOcxgn82MBw= X-Google-Smtp-Source: ABdhPJzDExivl7rpWMOyX9aolOXAQh5j4kAhedPFfOZhkIwIHmZqDClwPu+ZundpLhq5+nas48pSlg== X-Received: by 2002:a63:9544:: with SMTP id t4mr1913508pgn.175.1641632191655; Sat, 08 Jan 2022 00:56:31 -0800 (PST) Received: from localhost.localdomain ([58.100.34.57]) by smtp.gmail.com with ESMTPSA id x25sm1240990pfu.113.2022.01.08.00.56.29 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 08 Jan 2022 00:56:31 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v8 1/6] unpack-objects: low memory footprint for get_data() in dry_run mode Date: Sat, 8 Jan 2022 16:54:14 +0800 Message-Id: <20220108085419.79682-2-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.gc288e771b4.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin As the name implies, "get_data(size)" will allocate and return a given size of memory. Allocating memory for a large blob object may cause the system to run out of memory. Before preparing to replace calling of "get_data()" to unpack large blob objects in latter commits, refactor "get_data()" to reduce memory footprint for dry_run mode. Because in dry_run mode, "get_data()" is only used to check the integrity of data, and the returned buffer is not used at all, we can allocate a smaller buffer and reuse it as zstream output. Therefore, in dry_run mode, "get_data()" will release the allocated buffer and return NULL instead of returning garbage data. Suggested-by: Jiang Xin Signed-off-by: Han Xin --- builtin/unpack-objects.c | 39 ++++++++++++++++++------- t/t5329-unpack-large-objects.sh | 52 +++++++++++++++++++++++++++++++++ 2 files changed, 80 insertions(+), 11 deletions(-) create mode 100755 t/t5329-unpack-large-objects.sh diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 4a9466295b..c6d6c17072 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -96,15 +96,31 @@ static void use(int bytes) display_throughput(progress, consumed_bytes); } +/* + * Decompress zstream from stdin and return specific size of data. + * The caller is responsible to free the returned buffer. + * + * But for dry_run mode, "get_data()" is only used to check the + * integrity of data, and the returned buffer is not used at all. + * Therefore, in dry_run mode, "get_data()" will release the small + * allocated buffer which is reused to hold temporary zstream output + * and return NULL instead of returning garbage data. + */ static void *get_data(unsigned long size) { git_zstream stream; - void *buf = xmallocz(size); + unsigned long bufsize; + void *buf; memset(&stream, 0, sizeof(stream)); + if (dry_run && size > 8192) + bufsize = 8192; + else + bufsize = size; + buf = xmallocz(bufsize); stream.next_out = buf; - stream.avail_out = size; + stream.avail_out = bufsize; stream.next_in = fill(1); stream.avail_in = len; git_inflate_init(&stream); @@ -124,8 +140,15 @@ static void *get_data(unsigned long size) } stream.next_in = fill(1); stream.avail_in = len; + if (dry_run) { + /* reuse the buffer in dry_run mode */ + stream.next_out = buf; + stream.avail_out = bufsize; + } } git_inflate_end(&stream); + if (dry_run) + FREE_AND_NULL(buf); return buf; } @@ -325,10 +348,8 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size, { void *buf = get_data(size); - if (!dry_run && buf) + if (buf) write_object(nr, type, buf, size); - else - free(buf); } static int resolve_against_held(unsigned nr, const struct object_id *base, @@ -358,10 +379,8 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, oidread(&base_oid, fill(the_hash_algo->rawsz)); use(the_hash_algo->rawsz); delta_data = get_data(delta_size); - if (dry_run || !delta_data) { - free(delta_data); + if (!delta_data) return; - } if (has_object_file(&base_oid)) ; /* Ok we have this one */ else if (resolve_against_held(nr, &base_oid, @@ -397,10 +416,8 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, die("offset value out of bound for delta base object"); delta_data = get_data(delta_size); - if (dry_run || !delta_data) { - free(delta_data); + if (!delta_data) return; - } lo = 0; hi = nr; while (lo < hi) { diff --git a/t/t5329-unpack-large-objects.sh b/t/t5329-unpack-large-objects.sh new file mode 100755 index 0000000000..39c7a62d94 --- /dev/null +++ b/t/t5329-unpack-large-objects.sh @@ -0,0 +1,52 @@ +#!/bin/sh +# +# Copyright (c) 2021 Han Xin +# + +test_description='git unpack-objects with large objects' + +. ./test-lib.sh + +prepare_dest () { + test_when_finished "rm -rf dest.git" && + git init --bare dest.git +} + +assert_no_loose () { + glob=dest.git/objects/?? && + echo "$glob" >expect && + eval "echo $glob" >actual && + test_cmp expect actual +} + +assert_no_pack () { + rmdir dest.git/objects/pack +} + +test_expect_success "create large objects (1.5 MB) and PACK" ' + test-tool genrandom foo 1500000 >big-blob && + test_commit --append foo big-blob && + test-tool genrandom bar 1500000 >big-blob && + test_commit --append bar big-blob && + PACK=$(echo HEAD | git pack-objects --revs test) +' + +test_expect_success 'set memory limitation to 1MB' ' + GIT_ALLOC_LIMIT=1m && + export GIT_ALLOC_LIMIT +' + +test_expect_success 'unpack-objects failed under memory limitation' ' + prepare_dest && + test_must_fail git -C dest.git unpack-objects err && + grep "fatal: attempting to allocate" err +' + +test_expect_success 'unpack-objects works with memory limitation in dry-run mode' ' + prepare_dest && + git -C dest.git unpack-objects -n X-Patchwork-Id: 12707413 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 056D3C433EF for ; Sat, 8 Jan 2022 08:56:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233842AbiAHI4g (ORCPT ); Sat, 8 Jan 2022 03:56:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36156 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231402AbiAHI4e (ORCPT ); Sat, 8 Jan 2022 03:56:34 -0500 Received: from mail-pg1-x531.google.com (mail-pg1-x531.google.com [IPv6:2607:f8b0:4864:20::531]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BCDB0C061574 for ; Sat, 8 Jan 2022 00:56:34 -0800 (PST) Received: by mail-pg1-x531.google.com with SMTP id v25so7878211pge.2 for ; Sat, 08 Jan 2022 00:56:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Ta75UCWo4uaFf+Myyrd3yxCCpaAj5zRZZziSce7FLFg=; b=B19Wm1MkLi4ScQzGOhQ/eqtGa6aFBiMEC6Nv9lmRHPTfxz112d+TEZO1gpCknnA7LQ Tpfl2l+BczbvRwRbJKoKXY08Q3hCF6+zKVbmVnaFhdd4uPQkXPpx64xfl2dbkr9gKRoy LB7vICqiOlDPJhHchtijM0POz2mhC/RYCL4ZHwbbfrLiPBUrJ91Oh/iJn/ljWdz6uniC /hX5l8Eyyg0O0X8CLR+NvQ8GFT9AJERKjR/cIR1bFCQDAhvsUM0woQVP/quGHnhIsXqy VwXXRA97SkHMJs8YZO2JXVB4UM6tt/xtAx7UsVDiZ0w0kWcWngcUnZQp5VYvyUkSCSuK 0Xqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Ta75UCWo4uaFf+Myyrd3yxCCpaAj5zRZZziSce7FLFg=; b=FtwbUtYU6j3SYiOgu8ACS0WLADxWHxS0DIZE9X6etsQtu+r1n9aHnVZeQgZoHcIjvN p2vdRo9ZydQKIr9zXQsJu1P5RJll4lGciZ0fCmHwIOU7kNiT2KXYIbfSx/PBjr7COjpm CvAZEGrTCZhM3aDXe/tZMamWeJDJyRLIxCcF8Kmu3Kij87IUoGe7LjK3pNZGKgtZv9gr O1t6gcA+d5qxmcox/N7hVni5PD1FURuF7WsUwtV3kOqgHILGouj5iVmbhu7QLrtf0bYF kSwsvFGaGG2EVFRhmuq9jb2J4+VKuQeK9bY4s64MIdIqeedia1iGM6nqZ+aMiaQC0MI1 ZjzA== X-Gm-Message-State: AOAM532B5VzrCpDqVd4ltsGb11c6cAQmN0JKTNI+h0PPJRm5IE0AX+Nx xb9lhVDylPr4o2G9oh8WkbY= X-Google-Smtp-Source: ABdhPJz65Uyn0AtQqFhCUxmqYpoxLqNZxbqUmRfyPHvCxGmW65ezewcADdXSm9W5gXes8GtcrqI0Qw== X-Received: by 2002:a63:7148:: with SMTP id b8mr37114750pgn.616.1641632194259; Sat, 08 Jan 2022 00:56:34 -0800 (PST) Received: from localhost.localdomain ([58.100.34.57]) by smtp.gmail.com with ESMTPSA id x25sm1240990pfu.113.2022.01.08.00.56.31 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 08 Jan 2022 00:56:33 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v8 2/6] object-file.c: refactor write_loose_object() to several steps Date: Sat, 8 Jan 2022 16:54:15 +0800 Message-Id: <20220108085419.79682-3-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.gc288e771b4.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin When writing a large blob using "write_loose_object()", we have to pass a buffer with the whole content of the blob, and this behavior will consume lots of memory and may cause OOM. We will introduce a stream version function ("stream_loose_object()") in latter commit to resolve this issue. Before introducing a stream vesion function for writing loose object, do some refactoring on "write_loose_object()" to reuse code for both versions. Rewrite "write_loose_object()" as follows: 1. Figure out a path for the (temp) object file. This step is only used in "write_loose_object()". 2. Move common steps for starting to write loose objects into a new function "start_loose_object_common()". 3. Compress data. 4. Move common steps for ending zlib stream into a new funciton "end_loose_object_common()". 5. Close fd and finalize the object file. Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Jiang Xin Signed-off-by: Han Xin --- object-file.c | 149 +++++++++++++++++++++++++++++++++++--------------- 1 file changed, 105 insertions(+), 44 deletions(-) diff --git a/object-file.c b/object-file.c index eb1426f98c..5d163081b1 100644 --- a/object-file.c +++ b/object-file.c @@ -1743,6 +1743,25 @@ static void write_object_file_prepare(const struct git_hash_algo *algo, algo->final_oid_fn(oid, &c); } +/* + * Move the just written object with proper mtime into its final resting place. + */ +static int finalize_object_file_with_mtime(const char *tmpfile, + const char *filename, + time_t mtime, + unsigned flags) +{ + struct utimbuf utb; + + if (mtime) { + utb.actime = mtime; + utb.modtime = mtime; + if (utime(tmpfile, &utb) < 0 && !(flags & HASH_SILENT)) + warning_errno(_("failed utime() on %s"), tmpfile); + } + return finalize_object_file(tmpfile, filename); +} + /* * Move the just written object into its final resting place. */ @@ -1828,7 +1847,8 @@ static inline int directory_size(const char *filename) * We want to avoid cross-directory filename renames, because those * can have problems on various filesystems (FAT, NFS, Coda). */ -static int create_tmpfile(struct strbuf *tmp, const char *filename) +static int create_tmpfile(struct strbuf *tmp, const char *filename, + unsigned flags) { int fd, dirlen = directory_size(filename); @@ -1836,7 +1856,9 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename) strbuf_add(tmp, filename, dirlen); strbuf_addstr(tmp, "tmp_obj_XXXXXX"); fd = git_mkstemp_mode(tmp->buf, 0444); - if (fd < 0 && dirlen && errno == ENOENT) { + do { + if (fd >= 0 || !dirlen || errno != ENOENT) + break; /* * Make sure the directory exists; note that the contents * of the buffer are undefined after mkstemp returns an @@ -1846,17 +1868,72 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename) strbuf_reset(tmp); strbuf_add(tmp, filename, dirlen - 1); if (mkdir(tmp->buf, 0777) && errno != EEXIST) - return -1; + break; if (adjust_shared_perm(tmp->buf)) - return -1; + break; /* Try again */ strbuf_addstr(tmp, "/tmp_obj_XXXXXX"); fd = git_mkstemp_mode(tmp->buf, 0444); + } while (0); + + if (fd < 0 && !(flags & HASH_SILENT)) { + if (errno == EACCES) + return error(_("insufficient permission for adding an " + "object to repository database %s"), + get_object_directory()); + else + return error_errno(_("unable to create temporary file")); } + return fd; } +static int start_loose_object_common(struct strbuf *tmp_file, + const char *filename, unsigned flags, + git_zstream *stream, + unsigned char *buf, size_t buflen, + git_hash_ctx *c, + enum object_type type, size_t len, + char *hdr, int hdrlen) +{ + int fd; + + fd = create_tmpfile(tmp_file, filename, flags); + if (fd < 0) + return -1; + + /* Setup zlib stream for compression */ + git_deflate_init(stream, zlib_compression_level); + stream->next_out = buf; + stream->avail_out = buflen; + the_hash_algo->init_fn(c); + + /* Start to feed header to zlib stream */ + stream->next_in = (unsigned char *)hdr; + stream->avail_in = hdrlen; + while (git_deflate(stream, 0) == Z_OK) + ; /* nothing */ + the_hash_algo->update_fn(c, hdr, hdrlen); + + return fd; +} + +static void end_loose_object_common(int ret, git_hash_ctx *c, + git_zstream *stream, + struct object_id *parano_oid, + const struct object_id *expected_oid, + const char *die_msg1_fmt, + const char *die_msg2_fmt) +{ + if (ret != Z_STREAM_END) + die(_(die_msg1_fmt), ret, expected_oid); + ret = git_deflate_end_gently(stream); + if (ret != Z_OK) + die(_(die_msg2_fmt), ret, expected_oid); + the_hash_algo->final_oid_fn(parano_oid, c); +} + static int write_loose_object(const struct object_id *oid, char *hdr, int hdrlen, const void *buf, unsigned long len, time_t mtime, unsigned flags) @@ -1871,28 +1948,18 @@ static int write_loose_object(const struct object_id *oid, char *hdr, loose_object_path(the_repository, &filename, oid); - fd = create_tmpfile(&tmp_file, filename.buf); - if (fd < 0) { - if (flags & HASH_SILENT) - return -1; - else if (errno == EACCES) - return error(_("insufficient permission for adding an object to repository database %s"), get_object_directory()); - else - return error_errno(_("unable to create temporary file")); - } - - /* Set it up */ - git_deflate_init(&stream, zlib_compression_level); - stream.next_out = compressed; - stream.avail_out = sizeof(compressed); - the_hash_algo->init_fn(&c); - - /* First header.. */ - stream.next_in = (unsigned char *)hdr; - stream.avail_in = hdrlen; - while (git_deflate(&stream, 0) == Z_OK) - ; /* nothing */ - the_hash_algo->update_fn(&c, hdr, hdrlen); + /* Common steps for write_loose_object and stream_loose_object to + * start writing loose oject: + * + * - Create tmpfile for the loose object. + * - Setup zlib stream for compression. + * - Start to feed header to zlib stream. + */ + fd = start_loose_object_common(&tmp_file, filename.buf, flags, + &stream, compressed, sizeof(compressed), + &c, OBJ_NONE, 0, hdr, hdrlen); + if (fd < 0) + return -1; /* Then the data itself.. */ stream.next_in = (void *)buf; @@ -1907,30 +1974,24 @@ static int write_loose_object(const struct object_id *oid, char *hdr, stream.avail_out = sizeof(compressed); } while (ret == Z_OK); - if (ret != Z_STREAM_END) - die(_("unable to deflate new object %s (%d)"), oid_to_hex(oid), - ret); - ret = git_deflate_end_gently(&stream); - if (ret != Z_OK) - die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), - ret); - the_hash_algo->final_oid_fn(¶no_oid, &c); + /* Common steps for write_loose_object and stream_loose_object to + * end writing loose oject: + * + * - End the compression of zlib stream. + * - Get the calculated oid to "parano_oid". + */ + end_loose_object_common(ret, &c, &stream, ¶no_oid, oid, + N_("unable to deflate new object %s (%d)"), + N_("deflateEnd on object %s failed (%d)")); + if (!oideq(oid, ¶no_oid)) die(_("confused by unstable object source data for %s"), oid_to_hex(oid)); close_loose_object(fd); - if (mtime) { - struct utimbuf utb; - utb.actime = mtime; - utb.modtime = mtime; - if (utime(tmp_file.buf, &utb) < 0 && - !(flags & HASH_SILENT)) - warning_errno(_("failed utime() on %s"), tmp_file.buf); - } - - return finalize_object_file(tmp_file.buf, filename.buf); + return finalize_object_file_with_mtime(tmp_file.buf, filename.buf, + mtime, flags); } static int freshen_loose_object(const struct object_id *oid) From patchwork Sat Jan 8 08:54:16 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12707414 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E962C433EF for ; Sat, 8 Jan 2022 08:56:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233815AbiAHI4i (ORCPT ); Sat, 8 Jan 2022 03:56:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36166 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231402AbiAHI4h (ORCPT ); Sat, 8 Jan 2022 03:56:37 -0500 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3A29FC061574 for ; Sat, 8 Jan 2022 00:56:37 -0800 (PST) Received: by mail-pj1-x102f.google.com with SMTP id r16-20020a17090a0ad000b001b276aa3aabso14816588pje.0 for ; Sat, 08 Jan 2022 00:56:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=+sBRqX3wRqfZPScS0uBjLr2x1Q+9m/7H0QMgLJdRyXk=; b=KcsWEFNdQ6RnTzGcyHkyBWjurmCnygbE+XR83bfYODXoJBOBq+Sk7qfyVjHTkYrw8u PIoFQeqPAwC2OEOK5gTX0HJVnFuVJqc6yMqkTpm1Ui8xGgl1NlY61177H3uDrM7P9+1e Em19oncJG+3+dASKxupk2l+53fxTqitB4SkMDRdPAQaRV9yOyCBh9vsyvZch0O0692Nz iC5C3vn9JuxQhdWgLWPA/MMcYP3p0f7AqMYXVuUB83tove2ZkhlegDrtb8biYGc/JWwA TZh2I0cnCYbygOg6TtEO6CksYOFK/TpBy8Mc4UtEQu2nNRz4ZYrxAhBOXrdjbTPvVUI+ LhUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=+sBRqX3wRqfZPScS0uBjLr2x1Q+9m/7H0QMgLJdRyXk=; b=yA/5iQA/oZKK57fSw1nClPymiJMjbZV9lOZE1Fumacdfjk4OmGPS5usSsKz8zsX7KY ldF4Yrtreg6cOs9oAgToh7cuvFWKIQZrWl/cZoo06XikTc+AYd87pGpd/lXC5ulg+Efb qnAt51fm3Pxzn1JAizO2vPky8HUfC/TTd3L28Qn5f8U7qnfRUNxFtTGBq1zXaHcxAk7R iprjBLf+EkZvhJEysRi3sFCOK2ifIuhGo1tWlj+zBvMfhBmhQC/q5ZT0qXxZjmdgOn0I wl/MHBPFs7AWGoxWviGTNMmcbB4hXOCbiHLk04J7NGbi9euUudSZ6X00SjTvkr0HzcTH lh1A== X-Gm-Message-State: AOAM531hdceKC1SSfBMOxeq0DYwHuAVvb8MhXA6WGcEa3kObDS1bGM77 AR1Ei5JytOs9/X9rprlMhPk= X-Google-Smtp-Source: ABdhPJyjjxjsNo6ot+RclcUGHrs1al13DtpB3jYhRT+EVDJ2spczD/Y//Rq69NUrzkhPLRy1tx4STg== X-Received: by 2002:a17:902:b58d:b0:149:9c02:6345 with SMTP id a13-20020a170902b58d00b001499c026345mr43748629pls.21.1641632196838; Sat, 08 Jan 2022 00:56:36 -0800 (PST) Received: from localhost.localdomain ([58.100.34.57]) by smtp.gmail.com with ESMTPSA id x25sm1240990pfu.113.2022.01.08.00.56.34 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 08 Jan 2022 00:56:36 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v8 3/6] object-file.c: remove the slash for directory_size() Date: Sat, 8 Jan 2022 16:54:16 +0800 Message-Id: <20220108085419.79682-4-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.gc288e771b4.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin Since "mkdir foo/" works as well as "mkdir foo", let's remove the end slash as many users of it want. Suggested-by: Ævar Arnfjörð Bjarmason Signed-off-by: Han Xin --- object-file.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/object-file.c b/object-file.c index 5d163081b1..4f0127e823 100644 --- a/object-file.c +++ b/object-file.c @@ -1831,13 +1831,13 @@ static void close_loose_object(int fd) die_errno(_("error when closing loose object file")); } -/* Size of directory component, including the ending '/' */ +/* Size of directory component, excluding the ending '/' */ static inline int directory_size(const char *filename) { const char *s = strrchr(filename, '/'); if (!s) return 0; - return s - filename + 1; + return s - filename; } /* @@ -1854,7 +1854,7 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename, strbuf_reset(tmp); strbuf_add(tmp, filename, dirlen); - strbuf_addstr(tmp, "tmp_obj_XXXXXX"); + strbuf_addstr(tmp, "/tmp_obj_XXXXXX"); fd = git_mkstemp_mode(tmp->buf, 0444); do { if (fd >= 0 || !dirlen || errno != ENOENT) @@ -1866,7 +1866,7 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename, * scratch. */ strbuf_reset(tmp); - strbuf_add(tmp, filename, dirlen - 1); + strbuf_add(tmp, filename, dirlen); if (mkdir(tmp->buf, 0777) && errno != EEXIST) break; if (adjust_shared_perm(tmp->buf)) From patchwork Sat Jan 8 08:54:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12707415 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95748C433F5 for ; Sat, 8 Jan 2022 08:56:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233971AbiAHI4r (ORCPT ); Sat, 8 Jan 2022 03:56:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233961AbiAHI4o (ORCPT ); Sat, 8 Jan 2022 03:56:44 -0500 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0FD6AC061747 for ; Sat, 8 Jan 2022 00:56:40 -0800 (PST) Received: by mail-pl1-x62c.google.com with SMTP id e19so145219plc.10 for ; Sat, 08 Jan 2022 00:56:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Bn7aWpicFpGEy8B9B/JjBfd85aVgRILWqLk+gDEGNmM=; b=Z64+BXdb/UkmfJkJE0evi2bxTPYvG4cTaa7sOK7CAW/rSCvMvZoHEg3/7zNyiY4kGB F1d+8VgCkwHL9Zh211ZPTQZbDLWPTrZzwvx2SsfKaqg5VAQdvpdt4UpnCIvY/vT56pnk yrPc+746h02DyFpH2oIWYEADcLKIG73PyvBB0tHMyHiH5Whg+2aLhtIdtoNDpsnvzKTN ZvXjyHdPIa3wTQHWeoX6N8vpWRUg4xFWGrFfdPhxKqZ962eHxOWuzANEj5M9GgPCFt6Y E32hAixeOZ/gVuNLdatnlRpqX9y2miPWo6WwRaOXBQyw19EFg9RJnwncTS/KVTv6jjie 14RA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Bn7aWpicFpGEy8B9B/JjBfd85aVgRILWqLk+gDEGNmM=; b=DrZ+Uj9HTrqf5Xe/Ql61NSe7aGcv0FFwAju0RcmZ03sekUhGtY59w9kwRAMVvcRyPc 5X8qwaOSdUSWDhdbaCLNZ3uWSklF4DdjV2S4/Z4cuf9beN58fUa62M9PzpeZ6//rfLEz Cvj/ecBlPE1ZwiQWdhxhRNhXYxY7aMTlxXrTHU2kf+qLB/CNEncCpEN9ggBLhqZ1ld7M BqsTGYlqQdyYaj401RJ/Pd3Y2JcDfl5QXJQ85htcBiO57/5dsdzdoOwHxQT1le1J4bhw 9C8j3Ekl26z5PAfSWYTzkDUZ1h7r1wiOLfofs/Oc/8i+EYDhDdKbiVah3pED5FXLwG0b uXvA== X-Gm-Message-State: AOAM530/LEVfLUncxIpSjOmZW/8xJPHfPmWPUbQ48ur0IoDf1ejTfRBl gQdsMmyEKKIe+cb9+V0VGHc= X-Google-Smtp-Source: ABdhPJyQ4rv+fO6y5Y6fmAOmLJgAghC9gvkQQ1zRfKU5lBrN3aA6Kiy5cpAOu02UZ7w5mpphR7UlaQ== X-Received: by 2002:a17:902:da89:b0:149:304b:fd70 with SMTP id j9-20020a170902da8900b00149304bfd70mr65477938plx.53.1641632199474; Sat, 08 Jan 2022 00:56:39 -0800 (PST) Received: from localhost.localdomain ([58.100.34.57]) by smtp.gmail.com with ESMTPSA id x25sm1240990pfu.113.2022.01.08.00.56.37 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 08 Jan 2022 00:56:39 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v8 4/6] object-file.c: add "stream_loose_object()" to handle large object Date: Sat, 8 Jan 2022 16:54:17 +0800 Message-Id: <20220108085419.79682-5-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.gc288e771b4.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin If we want unpack and write a loose object using "write_loose_object", we have to feed it with a buffer with the same size of the object, which will consume lots of memory and may cause OOM. This can be improved by feeding data to "stream_loose_object()" in a stream. Add a new function "stream_loose_object()", which is a stream version of "write_loose_object()" but with a low memory footprint. We will use this function to unpack large blob object in latter commit. Another difference with "write_loose_object()" is that we have no chance to run "write_object_file_prepare()" to calculate the oid in advance. In "write_loose_object()", we know the oid and we can write the temporary file in the same directory as the final object, but for an object with an undetermined oid, we don't know the exact directory for the object, so we have to save the temporary file in ".git/objects/" directory instead. "freshen_packed_object()" or "freshen_loose_object()" will be called inside "stream_loose_object()" after obtaining the "oid". Helped-by: René Scharfe Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Jiang Xin Signed-off-by: Han Xin --- object-file.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++ object-store.h | 9 +++++ 2 files changed, 110 insertions(+) diff --git a/object-file.c b/object-file.c index 4f0127e823..a462a21629 100644 --- a/object-file.c +++ b/object-file.c @@ -2012,6 +2012,107 @@ static int freshen_packed_object(const struct object_id *oid) return 1; } +int stream_loose_object(struct input_stream *in_stream, size_t len, + struct object_id *oid) +{ + int fd, ret, err = 0, flush = 0; + unsigned char compressed[4096]; + git_zstream stream; + git_hash_ctx c; + struct strbuf tmp_file = STRBUF_INIT; + struct strbuf filename = STRBUF_INIT; + int dirlen; + char hdr[MAX_HEADER_LEN]; + int hdrlen; + + /* Since oid is not determined, save tmp file to odb path. */ + strbuf_addf(&filename, "%s/", get_object_directory()); + hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX, type_name(OBJ_BLOB), len) + 1; + + /* Common steps for write_loose_object and stream_loose_object to + * start writing loose oject: + * + * - Create tmpfile for the loose object. + * - Setup zlib stream for compression. + * - Start to feed header to zlib stream. + */ + fd = start_loose_object_common(&tmp_file, filename.buf, 0, + &stream, compressed, sizeof(compressed), + &c, OBJ_BLOB, len, hdr, hdrlen); + if (fd < 0) { + err = -1; + goto cleanup; + } + + /* Then the data itself.. */ + do { + unsigned char *in0 = stream.next_in; + if (!stream.avail_in && !in_stream->is_finished) { + const void *in = in_stream->read(in_stream, &stream.avail_in); + stream.next_in = (void *)in; + in0 = (unsigned char *)in; + /* All data has been read. */ + if (in_stream->is_finished) + flush = Z_FINISH; + } + ret = git_deflate(&stream, flush); + the_hash_algo->update_fn(&c, in0, stream.next_in - in0); + if (write_buffer(fd, compressed, stream.next_out - compressed) < 0) + die(_("unable to write loose object file")); + stream.next_out = compressed; + stream.avail_out = sizeof(compressed); + /* + * Unlike write_loose_object(), we do not have the entire + * buffer. If we get Z_BUF_ERROR due to too few input bytes, + * then we'll replenish them in the next input_stream->read() + * call when we loop. + */ + } while (ret == Z_OK || ret == Z_BUF_ERROR); + + if (stream.total_in != len + hdrlen) + die(_("write stream object %ld != %"PRIuMAX), stream.total_in, + (uintmax_t)len + hdrlen); + + /* Common steps for write_loose_object and stream_loose_object to + * end writing loose oject: + * + * - End the compression of zlib stream. + * - Get the calculated oid. + */ + end_loose_object_common(ret, &c, &stream, oid, NULL, + N_("unable to stream deflate new object (%d)"), + N_("deflateEnd on stream object failed (%d)")); + + close_loose_object(fd); + + if (freshen_packed_object(oid) || freshen_loose_object(oid)) { + unlink_or_warn(tmp_file.buf); + goto cleanup; + } + + loose_object_path(the_repository, &filename, oid); + + /* We finally know the object path, and create the missing dir. */ + dirlen = directory_size(filename.buf); + if (dirlen) { + struct strbuf dir = STRBUF_INIT; + strbuf_add(&dir, filename.buf, dirlen); + + if (mkdir_in_gitdir(dir.buf) && errno != EEXIST) { + err = error_errno(_("unable to create directory %s"), dir.buf); + strbuf_release(&dir); + goto cleanup; + } + strbuf_release(&dir); + } + + err = finalize_object_file(tmp_file.buf, filename.buf); +cleanup: + strbuf_release(&tmp_file); + strbuf_release(&filename); + return err; +} + int write_object_file_flags(const void *buf, unsigned long len, const char *type, struct object_id *oid, unsigned flags) diff --git a/object-store.h b/object-store.h index 952efb6a4b..cc41c64d69 100644 --- a/object-store.h +++ b/object-store.h @@ -34,6 +34,12 @@ struct object_directory { char *path; }; +struct input_stream { + const void *(*read)(struct input_stream *, unsigned long *len); + void *data; + int is_finished; +}; + KHASH_INIT(odb_path_map, const char * /* key: odb_path */, struct object_directory *, 1, fspathhash, fspatheq) @@ -232,6 +238,9 @@ static inline int write_object_file(const void *buf, unsigned long len, return write_object_file_flags(buf, len, type, oid, 0); } +int stream_loose_object(struct input_stream *in_stream, size_t len, + struct object_id *oid); + int hash_object_file_literally(const void *buf, unsigned long len, const char *type, struct object_id *oid, unsigned flags); From patchwork Sat Jan 8 08:54:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12707417 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB28DC433F5 for ; Sat, 8 Jan 2022 08:56:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233986AbiAHI4r (ORCPT ); Sat, 8 Jan 2022 03:56:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36216 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233908AbiAHI4p (ORCPT ); Sat, 8 Jan 2022 03:56:45 -0500 Received: from mail-pg1-x532.google.com (mail-pg1-x532.google.com [IPv6:2607:f8b0:4864:20::532]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A6230C06175E for ; Sat, 8 Jan 2022 00:56:42 -0800 (PST) Received: by mail-pg1-x532.google.com with SMTP id f8so7867501pgf.8 for ; Sat, 08 Jan 2022 00:56:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=iF6btyyvqMA4vkeWPMmMS0V6kNYg+v1sU5PYKy3hv7E=; b=pjVC1Dg8FPKm92IAepTB8xhJB6J0mZkx+MBThJsst6+gtKZLQuGBwH8NxSH5J4nCGJ fgKZmmTAPdTcLljXkGbaoL8u5HfHkeOmO3yk7WpdNdIis4k50AJUHehvX4yKMVTLJJRF Fu6Nd/H/p1BxDhn/fBELvq6U+tIK9upFHzZHYPA62ZrN3tCeqCViY5iFUJye9ZOTR42Z lhnRaOCSxTPUPW9+5K3EDQHsTC9kXKNBpaUo9wM6xkj58JjaQ+zQ/7D83745+S4nSAGy 01rv8o1pjLVjPh338uA7R+wuK6VtR4K60berSwWM9QZ0YHcbAuJ0bGKYm95BXDzYpd5Z KKGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=iF6btyyvqMA4vkeWPMmMS0V6kNYg+v1sU5PYKy3hv7E=; b=lGquWs+aG22SnSt6X+uPyDZftthyelrVnZkXpgE6BQlCglI8PPvIP5RHKTToTApHZ8 0/SFULbgWbinNAjrtUYtYN8aLXP1wnbch51+cRiArHk8wc/rJK5ED/B3VABhYy9GmuOg 5aUGbe1lmp+CAHq8+HZCuhZ0EzcYbbfu2tDSXRkQG4sEiATJdk8HOT9j2Frq3X4Uf1T8 3ZC/7Wfs7jriBlS9Iz5IjUt22T0BVET/Xid7XGiFa7uJSL40uaTfS+Tqbm6whuyskll0 R0up7QGC6U1fiwhLP8tlAztR5+zBXmjSFtiMK1kX+nTu5BlAwSIrSArxb0EqmoO7GY4p MYSg== X-Gm-Message-State: AOAM532puuibH9Dgcs3XyNv+Zz25oijmLSw4VYXfTcl+gqiMnzz5pTeg RqL6nwua6bddDIOnScKxFRk= X-Google-Smtp-Source: ABdhPJx4gvq/YPThVyxfnKgLbIbI4e8h4BIKUi+wZp2zJyfFECGxDJYSCSwkbkD5EopxPmZFCPjoIw== X-Received: by 2002:a63:6c04:: with SMTP id h4mr31066719pgc.30.1641632202201; Sat, 08 Jan 2022 00:56:42 -0800 (PST) Received: from localhost.localdomain ([58.100.34.57]) by smtp.gmail.com with ESMTPSA id x25sm1240990pfu.113.2022.01.08.00.56.39 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 08 Jan 2022 00:56:41 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v8 5/6] unpack-objects: unpack_non_delta_entry() read data in a stream Date: Sat, 8 Jan 2022 16:54:18 +0800 Message-Id: <20220108085419.79682-6-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.gc288e771b4.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin We used to call "get_data()" in "unpack_non_delta_entry()" to read the entire contents of a blob object, no matter how big it is. This implementation may consume all the memory and cause OOM. By implementing a zstream version of input_stream interface, we can use a small fixed buffer for "unpack_non_delta_entry()". However, unpack non-delta objects from a stream instead of from an entrie buffer will have 10% performance penalty. $ hyperfine \ --setup \ 'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \ --prepare 'rm -rf dest.git && git init --bare dest.git' \ ... Summary './git -C dest.git -c core.bigFileThreshold=512m unpack-objects Helped-by: Derrick Stolee Helped-by: Jiang Xin Signed-off-by: Han Xin --- builtin/unpack-objects.c | 71 ++++++++++++++++++++++++++++++++- t/t5329-unpack-large-objects.sh | 23 +++++++++-- 2 files changed, 90 insertions(+), 4 deletions(-) diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index c6d6c17072..e9ec2b349d 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -343,11 +343,80 @@ static void added_object(unsigned nr, enum object_type type, } } +struct input_zstream_data { + git_zstream *zstream; + unsigned char buf[8192]; + int status; +}; + +static const void *feed_input_zstream(struct input_stream *in_stream, + unsigned long *readlen) +{ + struct input_zstream_data *data = in_stream->data; + git_zstream *zstream = data->zstream; + void *in = fill(1); + + if (in_stream->is_finished) { + *readlen = 0; + return NULL; + } + + zstream->next_out = data->buf; + zstream->avail_out = sizeof(data->buf); + zstream->next_in = in; + zstream->avail_in = len; + + data->status = git_inflate(zstream, 0); + + in_stream->is_finished = data->status != Z_OK; + use(len - zstream->avail_in); + *readlen = sizeof(data->buf) - zstream->avail_out; + + return data->buf; +} + +static void write_stream_blob(unsigned nr, size_t size) +{ + git_zstream zstream = { 0 }; + struct input_zstream_data data = { 0 }; + struct input_stream in_stream = { + .read = feed_input_zstream, + .data = &data, + }; + + data.zstream = &zstream; + git_inflate_init(&zstream); + + if (stream_loose_object(&in_stream, size, &obj_list[nr].oid)) + die(_("failed to write object in stream")); + + if (data.status != Z_STREAM_END) + die(_("inflate returned (%d)"), data.status); + git_inflate_end(&zstream); + + if (strict) { + struct blob *blob = + lookup_blob(the_repository, &obj_list[nr].oid); + if (blob) + blob->object.flags |= FLAG_WRITTEN; + else + die(_("invalid blob object from stream")); + } + obj_list[nr].obj = NULL; +} + static void unpack_non_delta_entry(enum object_type type, unsigned long size, unsigned nr) { - void *buf = get_data(size); + void *buf; + + /* Write large blob in stream without allocating full buffer. */ + if (!dry_run && type == OBJ_BLOB && size > big_file_threshold) { + write_stream_blob(nr, size); + return; + } + buf = get_data(size); if (buf) write_object(nr, type, buf, size); } diff --git a/t/t5329-unpack-large-objects.sh b/t/t5329-unpack-large-objects.sh index 39c7a62d94..6f3bfb3df7 100755 --- a/t/t5329-unpack-large-objects.sh +++ b/t/t5329-unpack-large-objects.sh @@ -9,7 +9,11 @@ test_description='git unpack-objects with large objects' prepare_dest () { test_when_finished "rm -rf dest.git" && - git init --bare dest.git + git init --bare dest.git && + if test -n "$1" + then + git -C dest.git config core.bigFileThreshold $1 + fi } assert_no_loose () { @@ -37,16 +41,29 @@ test_expect_success 'set memory limitation to 1MB' ' ' test_expect_success 'unpack-objects failed under memory limitation' ' - prepare_dest && + prepare_dest 2m && test_must_fail git -C dest.git unpack-objects err && grep "fatal: attempting to allocate" err ' test_expect_success 'unpack-objects works with memory limitation in dry-run mode' ' - prepare_dest && + prepare_dest 2m && git -C dest.git unpack-objects -n X-Patchwork-Id: 12707416 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC8B5C433FE for ; Sat, 8 Jan 2022 08:56:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233992AbiAHI4s (ORCPT ); Sat, 8 Jan 2022 03:56:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36238 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233920AbiAHI4p (ORCPT ); Sat, 8 Jan 2022 03:56:45 -0500 Received: from mail-pf1-x429.google.com (mail-pf1-x429.google.com [IPv6:2607:f8b0:4864:20::429]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 41979C061574 for ; Sat, 8 Jan 2022 00:56:45 -0800 (PST) Received: by mail-pf1-x429.google.com with SMTP id p37so7263383pfh.4 for ; Sat, 08 Jan 2022 00:56:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=i6EaJY25k8TGr6OPOF//g4RelKv2TzOdaMRp6nlc9uI=; b=YlEhyGTFKzPNdiB56QsnjsEU681zXzsKy85GliN6ooLkpmRCT4AZO529BHb3IXYrVS IvRdVPilJlosiSIqSV35stEtQifsTwO1Hq5yxJs3Zzluza9v0lo0pqXb8rkB1mH2ruHd yHKdKxIfwnK+VlNFjyejnRUlxM+Tr64hEeVKsgwiEl81zCwfoBffXs5phROZTn4jPGzB /iyVTgl0xTqJV+F4L5gJeGDZSkyF+nVRn25KqOgYhEgf3s/K2SFiaqEAhZXdfHIVDXY5 8/ZF7AUChzmO5NDG+CJwmh9CH8d7GEWMNgv30dtAfJcyOdMGPlBmlbo7hZk84q+icyg1 THLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=i6EaJY25k8TGr6OPOF//g4RelKv2TzOdaMRp6nlc9uI=; b=25rQP3LgxSiWnPR6v15jR9bMpI4zd/fRKPuaZVu69kE8hHbjtRMd61ybdQvNP8uFHP /CDxhn3ZD1mEK7PDiA0W8W9PT1dKq9xwlNDkxIgt8qw9+Ecj8+59E+HQ0drqV5zrloMA pO6I3Na34Frs+EUoN7viqGPqCLzPz4BnZGYv9rcnjS1ujXKMU/mHDdifQvI23xn4c1OL s1HvqhVhIZxOppwpf4hYPCWYmRY4vWCFbjFRnBQqzMOCESpydMcLPz7L/K6r8byD4ZgG 3BaZcedYUmJaFYlPN/BNSYLsyIpMcB2o42cfXYEYNk3J7jdp2h9Njtck5a3llN8CGnzn K/3Q== X-Gm-Message-State: AOAM531YUwR9wZKyOY5OyVF2rVeRRfMpvzFQN0b8tnHIzyXi+Q2SLmfo tMefakTP2k5oTMAyXJa1Fko= X-Google-Smtp-Source: ABdhPJxO2F4iUXg1XgzS4fVFjp2o/mBk7P7C4lNMoiEfdOUgIXzIfqf/FHOtjEVe4LZyPa/yhsXSxw== X-Received: by 2002:a63:8c:: with SMTP id 134mr14232911pga.599.1641632204768; Sat, 08 Jan 2022 00:56:44 -0800 (PST) Received: from localhost.localdomain ([58.100.34.57]) by smtp.gmail.com with ESMTPSA id x25sm1240990pfu.113.2022.01.08.00.56.42 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 08 Jan 2022 00:56:44 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v8 6/6] object-file API: add a format_object_header() function Date: Sat, 8 Jan 2022 16:54:19 +0800 Message-Id: <20220108085419.79682-7-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.gc288e771b4.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Ævar Arnfjörð Bjarmason Add a convenience function to wrap the xsnprintf() command that generates loose object headers. This code was copy/pasted in various parts of the codebase, let's define it in one place and re-use it from there. All except one caller of it had a valid "enum object_type" for us, it's only write_object_file_prepare() which might need to deal with "git hash-object --literally" and a potential garbage type. Let's have the primary API use an "enum object_type", and define an *_extended() function that can take an arbitrary "const char *" for the type. See [1] for the discussion that prompted this patch, i.e. new code in object-file.c that wanted to copy/paste the xsnprintf() invocation. 1. https://lore.kernel.org/git/211213.86bl1l9bfz.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason Signed-off-by: Han Xin --- builtin/index-pack.c | 3 +-- bulk-checkin.c | 4 ++-- cache.h | 21 +++++++++++++++++++++ http-push.c | 2 +- object-file.c | 16 ++++++++++++---- 5 files changed, 37 insertions(+), 9 deletions(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index c23d01de7d..8a6ce77940 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -449,8 +449,7 @@ static void *unpack_entry_data(off_t offset, unsigned long size, int hdrlen; if (!is_delta_type(type)) { - hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX, - type_name(type),(uintmax_t)size) + 1; + hdrlen = format_object_header(hdr, sizeof(hdr), type, size); the_hash_algo->init_fn(&c); the_hash_algo->update_fn(&c, hdr, hdrlen); } else diff --git a/bulk-checkin.c b/bulk-checkin.c index 8785b2ac80..9e685f0f1a 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -220,8 +220,8 @@ static int deflate_to_pack(struct bulk_checkin_state *state, if (seekback == (off_t) -1) return error("cannot find the current offset"); - header_len = xsnprintf((char *)obuf, sizeof(obuf), "%s %" PRIuMAX, - type_name(type), (uintmax_t)size) + 1; + header_len = format_object_header((char *)obuf, sizeof(obuf), + type, size); the_hash_algo->init_fn(&ctx); the_hash_algo->update_fn(&ctx, obuf, header_len); diff --git a/cache.h b/cache.h index cfba463aa9..64071a8d80 100644 --- a/cache.h +++ b/cache.h @@ -1310,6 +1310,27 @@ enum unpack_loose_header_result unpack_loose_header(git_zstream *stream, unsigned long bufsiz, struct strbuf *hdrbuf); +/** + * format_object_header() is a thin wrapper around s xsnprintf() that + * writes the initial " " part of the loose object + * header. It returns the size that snprintf() returns + 1. + * + * The format_object_header_extended() function allows for writing a + * type_name that's not one of the "enum object_type" types. This is + * used for "git hash-object --literally". Pass in a OBJ_NONE as the + * type, and a non-NULL "type_str" to do that. + * + * format_object_header() is a convenience wrapper for + * format_object_header_extended(). + */ +int format_object_header_extended(char *str, size_t size, enum object_type type, + const char *type_str, size_t objsize); +static inline int format_object_header(char *str, size_t size, + enum object_type type, size_t objsize) +{ + return format_object_header_extended(str, size, type, NULL, objsize); +} + /** * parse_loose_header() parses the starting " \0" of an * object. If it doesn't follow that format -1 is returned. To check diff --git a/http-push.c b/http-push.c index 3309aaf004..f0c044dcf7 100644 --- a/http-push.c +++ b/http-push.c @@ -363,7 +363,7 @@ static void start_put(struct transfer_request *request) git_zstream stream; unpacked = read_object_file(&request->obj->oid, &type, &len); - hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX , type_name(type), (uintmax_t)len) + 1; + hdrlen = format_object_header(hdr, sizeof(hdr), type, len); /* Set it up */ git_deflate_init(&stream, zlib_compression_level); diff --git a/object-file.c b/object-file.c index a462a21629..d384ef2952 100644 --- a/object-file.c +++ b/object-file.c @@ -1006,6 +1006,14 @@ void *xmmap(void *start, size_t length, return ret; } +int format_object_header_extended(char *str, size_t size, enum object_type type, + const char *typestr, size_t objsize) +{ + const char *s = type == OBJ_NONE ? typestr : type_name(type); + + return xsnprintf(str, size, "%s %"PRIuMAX, s, (uintmax_t)objsize) + 1; +} + /* * With an in-core object data in "map", rehash it to make sure the * object name actually matches "oid" to detect object corruption. @@ -1034,7 +1042,7 @@ int check_object_signature(struct repository *r, const struct object_id *oid, return -1; /* Generate the header */ - hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX , type_name(obj_type), (uintmax_t)size) + 1; + hdrlen = format_object_header(hdr, sizeof(hdr), obj_type, size); /* Sha1.. */ r->hash_algo->init_fn(&c); @@ -1734,7 +1742,7 @@ static void write_object_file_prepare(const struct git_hash_algo *algo, git_hash_ctx c; /* Generate the header */ - *hdrlen = xsnprintf(hdr, *hdrlen, "%s %"PRIuMAX , type, (uintmax_t)len)+1; + *hdrlen = format_object_header_extended(hdr, *hdrlen, OBJ_NONE, type, len); /* Sha1.. */ algo->init_fn(&c); @@ -2027,7 +2035,7 @@ int stream_loose_object(struct input_stream *in_stream, size_t len, /* Since oid is not determined, save tmp file to odb path. */ strbuf_addf(&filename, "%s/", get_object_directory()); - hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX, type_name(OBJ_BLOB), len) + 1; + hdrlen = format_object_header(hdr, sizeof(hdr), OBJ_BLOB, len); /* Common steps for write_loose_object and stream_loose_object to * start writing loose oject: @@ -2168,7 +2176,7 @@ int force_object_loose(const struct object_id *oid, time_t mtime) buf = read_object(the_repository, oid, &type, &len); if (!buf) return error(_("cannot read object for %s"), oid_to_hex(oid)); - hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX , type_name(type), (uintmax_t)len) + 1; + hdrlen = format_object_header(hdr, sizeof(hdr), type, len); ret = write_loose_object(oid, hdr, hdrlen, buf, len, mtime, 0); free(buf);