From patchwork Fri Dec 10 10:34:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12669201 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54417C433F5 for ; Fri, 10 Dec 2021 10:35:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237701AbhLJKik (ORCPT ); Fri, 10 Dec 2021 05:38:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57906 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237729AbhLJKih (ORCPT ); Fri, 10 Dec 2021 05:38:37 -0500 Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51E11C061746 for ; Fri, 10 Dec 2021 02:35:02 -0800 (PST) Received: by mail-pg1-x530.google.com with SMTP id r5so7694486pgi.6 for ; Fri, 10 Dec 2021 02:35:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=+pYD1ko5Dr86OxJiFYWXfYMN9lM5KBG2alYRQXKafq8=; b=MVbUGiLYsNjaYGTJOLUxHe/fnWVUE6GxrbxQSb9adx9PrBNKYRV8FyPnAq8ABIssR/ GhWJKUD6U/9hZYrHQdIJAwCWb/9KUMQH18/W+l2fsPecHlehuq0xWJbBFupFwcGxTraS R/hnkwBwBsV6FeFKkZ2o0xCg0fIAMRZDmyeKlYrTRLfrqNGX93W8M9DXRjg38iUJOKfQ u8xBjB2FjXPfVrwLCtvZodA2zXGTE6x9+SE98MJCudjOu3BxylNhdRPFzD74KEFMSTGD lL82SveUBL5hirXgDnFchi/mscjLugFzdhxE6+aTedMomB7UM3+pme7XOT0wssvlPbvm RCUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=+pYD1ko5Dr86OxJiFYWXfYMN9lM5KBG2alYRQXKafq8=; b=7ZQ6kWIRWngearDUrzMN5g9uZyOl695Vsg/HAeJm51iLAh8xQtynVnJByXjx/bQAMg doFvsjvWVFUUu4S90+B57AVlUwdo/arlKan7Qr3K91OAjnuBkE347u3o7rkdDrYk9QrC E59qNs4xtz5A91EgCLXyR8ISA1axqSCDO4aC8suwGEXLWdwfmdgoeVoBrkUl5MSG8PZH D1MzmxZFuBMI/xP92s1oFYtTZ1RyWHiUbnh9Itftljlrz3cggpqqc/saPQVCD3I+GAni Sp+Pc5vHqrE9BSMB6H5Zgsr7urRGDyF0l7viO+VzQk11FIaOVuh8o8WCt8RWu1YkJx6V w5/Q== X-Gm-Message-State: AOAM532qVKWPu8hEKNgyi+LwWIDsEUgLtd81c8yydMAngAZaQYul/rjT +IShO12y/ZPa4v+AYM9DCgI= X-Google-Smtp-Source: ABdhPJxIunUehUEdkG5Gu/03z65X3GwR82hFuNqfPUJzulRKeKaPibAYuxcdIEDIRiKXeipYnxQxew== X-Received: by 2002:a63:5758:: with SMTP id h24mr39015611pgm.110.1639132501823; Fri, 10 Dec 2021 02:35:01 -0800 (PST) Received: from localhost.localdomain ([205.204.117.96]) by smtp.gmail.com with ESMTPSA id 204sm2396250pgb.63.2021.12.10.02.34.59 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 10 Dec 2021 02:35:01 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v5 1/6] object-file: refactor write_loose_object() to support read from stream Date: Fri, 10 Dec 2021 18:34:30 +0800 Message-Id: <20211210103435.83656-2-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211203093530.93589-1-chiyutianyi@gmail.com> References: <20211203093530.93589-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin We used to call "get_data()" in "unpack_non_delta_entry()" to read the entire contents of a blob object, no matter how big it is. This implementation may consume all the memory and cause OOM. This can be improved by feeding data to "write_loose_object()" in a stream. The input stream is implemented as an interface. In the first step, we add a new flag called "HASH_STREAM" and make a simple implementation, feeding the entire buffer in the stream to "write_loose_object()" as a refactor. Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Jiang Xin Signed-off-by: Han Xin --- cache.h | 1 + object-file.c | 7 ++++++- object-store.h | 5 +++++ 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/cache.h b/cache.h index eba12487b9..51bd435dea 100644 --- a/cache.h +++ b/cache.h @@ -888,6 +888,7 @@ int ie_modified(struct index_state *, const struct cache_entry *, struct stat *, #define HASH_FORMAT_CHECK 2 #define HASH_RENORMALIZE 4 #define HASH_SILENT 8 +#define HASH_STREAM 16 int index_fd(struct index_state *istate, struct object_id *oid, int fd, struct stat *st, enum object_type type, const char *path, unsigned flags); int index_path(struct index_state *istate, struct object_id *oid, const char *path, struct stat *st, unsigned flags); diff --git a/object-file.c b/object-file.c index eb972cdccd..06375a90d6 100644 --- a/object-file.c +++ b/object-file.c @@ -1898,7 +1898,12 @@ static int write_loose_object(const struct object_id *oid, char *hdr, the_hash_algo->update_fn(&c, hdr, hdrlen); /* Then the data itself.. */ - stream.next_in = (void *)buf; + if (flags & HASH_STREAM) { + struct input_stream *in_stream = (struct input_stream *)buf; + stream.next_in = (void *)in_stream->read(in_stream, &len); + } else { + stream.next_in = (void *)buf; + } stream.avail_in = len; do { unsigned char *in0 = stream.next_in; diff --git a/object-store.h b/object-store.h index 952efb6a4b..ccc1fc9c1a 100644 --- a/object-store.h +++ b/object-store.h @@ -34,6 +34,11 @@ struct object_directory { char *path; }; +struct input_stream { + const void *(*read)(struct input_stream *, unsigned long *len); + void *data; +}; + KHASH_INIT(odb_path_map, const char * /* key: odb_path */, struct object_directory *, 1, fspathhash, fspatheq) From patchwork Fri Dec 10 10:34:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12669203 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67CCFC433EF for ; Fri, 10 Dec 2021 10:35:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237787AbhLJKil (ORCPT ); Fri, 10 Dec 2021 05:38:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57920 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237753AbhLJKij (ORCPT ); Fri, 10 Dec 2021 05:38:39 -0500 Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A81CEC0617A1 for ; Fri, 10 Dec 2021 02:35:04 -0800 (PST) Received: by mail-pj1-x102b.google.com with SMTP id gx15-20020a17090b124f00b001a695f3734aso7208363pjb.0 for ; Fri, 10 Dec 2021 02:35:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=8RIOZi6mNEfRi49dnVxnJHTjMXcQln8hiHrUNS8zAww=; b=MlyZSeNS8OHOh6KPERV4fDyYcNVhPHKjQye702jD71xwRkgs9UbcrdmM5e1MZpQyk3 H6dTPBX3TBBuZ+48BvU5b4MhCPvY8RtyCM1mIeCOmbDIr6oFy/vCpVghYVirAE/UMrpT 5FafccmLzEEZt9RYujWQdinHWFdd5r5/LtRq00oFoCMNh5MkRwMxlRR+YUMuj5n0X+N5 6KX9RPmnLSIQTEmDgd+d0B4gtLyZJ7+8uF+MKfj+SeIIJUfS/+dhb+Rmqqc9/kD6+t3n jFrjMvXIJXQH7JLwRYLhIYprHu+OEilL33+ue6bzG/2JmjtDqZH8ajORSzIyZXYRPyyU Salw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=8RIOZi6mNEfRi49dnVxnJHTjMXcQln8hiHrUNS8zAww=; b=hvTOYhErdKu7UwhrS9lQEcJT+wRVKhm9RqzZnCnyWWfcR0wPljHoJ6Qfd3fEUX5xto Lv7idS1rhGxLMS4HFoI3VOg5Yt5khBfdjlHQxrEBOLJJ3WGeCGU2TI+V9zWuyIsp6tmU KE/GKhzholICPZnj+J3M6a3u0JPrV94oOHkHIeu3UO1uW/MNmsliyRcrm2M257JH6rhM atdjVP6L1xhLw+y2HR1Wbc7sIsxzYhhjo35o6wXT01YS6r5mBjIVnziH4ozcv1yeNXQ2 YD5bDeGsNIOVCYL2fsqmkcd/+1FuvgAvoYR8ByNPAXLb2+f6ubLKJGIdRrp07GJhX0LG CGAg== X-Gm-Message-State: AOAM530OZr4V1FDNr057leojegJjk8GnMnBxcv75FBSYm1TPKPdaxRV+ 4IYbiyEZaIFYiDYH/jpezJU= X-Google-Smtp-Source: ABdhPJyztp9RPAe8OxRnm5yKKJCgrUoDjUComAK31KHXa9Ppj5aWTD9a1QGQK9js9A6DWPPrqbI3zQ== X-Received: by 2002:a17:90a:c297:: with SMTP id f23mr22903293pjt.138.1639132504251; Fri, 10 Dec 2021 02:35:04 -0800 (PST) Received: from localhost.localdomain ([205.204.117.96]) by smtp.gmail.com with ESMTPSA id 204sm2396250pgb.63.2021.12.10.02.35.02 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 10 Dec 2021 02:35:03 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v5 2/6] object-file.c: handle undetermined oid in write_loose_object() Date: Fri, 10 Dec 2021 18:34:31 +0800 Message-Id: <20211210103435.83656-3-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211203093530.93589-1-chiyutianyi@gmail.com> References: <20211203093530.93589-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin When streaming a large blob object to "write_loose_object()", we have no chance to run "write_object_file_prepare()" to calculate the oid in advance. So we need to handle undetermined oid in function "write_loose_object()". In the original implementation, we know the oid and we can write the temporary file in the same directory as the final object, but for an object with an undetermined oid, we don't know the exact directory for the object, so we have to save the temporary file in ".git/objects/" directory instead. The promise that "oid" is constant in "write_loose_object()" has been removed because it will be filled after reading all stream data. Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Jiang Xin Signed-off-by: Han Xin --- object-file.c | 48 +++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 39 insertions(+), 9 deletions(-) diff --git a/object-file.c b/object-file.c index 06375a90d6..41099b137f 100644 --- a/object-file.c +++ b/object-file.c @@ -1860,11 +1860,11 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename) return fd; } -static int write_loose_object(const struct object_id *oid, char *hdr, +static int write_loose_object(struct object_id *oid, char *hdr, int hdrlen, const void *buf, unsigned long len, time_t mtime, unsigned flags) { - int fd, ret; + int fd, ret, err = 0; unsigned char compressed[4096]; git_zstream stream; git_hash_ctx c; @@ -1872,16 +1872,21 @@ static int write_loose_object(const struct object_id *oid, char *hdr, static struct strbuf tmp_file = STRBUF_INIT; static struct strbuf filename = STRBUF_INIT; - loose_object_path(the_repository, &filename, oid); + if (flags & HASH_STREAM) + /* When oid is not determined, save tmp file to odb path. */ + strbuf_addf(&filename, "%s/", get_object_directory()); + else + loose_object_path(the_repository, &filename, oid); fd = create_tmpfile(&tmp_file, filename.buf); if (fd < 0) { if (flags & HASH_SILENT) - return -1; + err = -1; else if (errno == EACCES) - return error(_("insufficient permission for adding an object to repository database %s"), get_object_directory()); + err = error(_("insufficient permission for adding an object to repository database %s"), get_object_directory()); else - return error_errno(_("unable to create temporary file")); + err = error_errno(_("unable to create temporary file")); + goto cleanup; } /* Set it up */ @@ -1923,12 +1928,34 @@ static int write_loose_object(const struct object_id *oid, char *hdr, die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), ret); the_hash_algo->final_oid_fn(¶no_oid, &c); - if (!oideq(oid, ¶no_oid)) + if (!(flags & HASH_STREAM) && !oideq(oid, ¶no_oid)) die(_("confused by unstable object source data for %s"), oid_to_hex(oid)); close_loose_object(fd); + if (flags & HASH_STREAM) { + int dirlen; + + oidcpy((struct object_id *)oid, ¶no_oid); + loose_object_path(the_repository, &filename, oid); + + /* We finally know the object path, and create the missing dir. */ + dirlen = directory_size(filename.buf); + if (dirlen) { + struct strbuf dir = STRBUF_INIT; + strbuf_add(&dir, filename.buf, dirlen - 1); + if (mkdir(dir.buf, 0777) && errno != EEXIST) + err = -1; + else if (adjust_shared_perm(dir.buf)) + err = -1; + else + strbuf_release(&dir); + if (err < 0) + goto cleanup; + } + } + if (mtime) { struct utimbuf utb; utb.actime = mtime; @@ -1938,7 +1965,10 @@ static int write_loose_object(const struct object_id *oid, char *hdr, warning_errno(_("failed utime() on %s"), tmp_file.buf); } - return finalize_object_file(tmp_file.buf, filename.buf); + err = finalize_object_file(tmp_file.buf, filename.buf); +cleanup: + strbuf_release(&filename); + return err; } static int freshen_loose_object(const struct object_id *oid) @@ -2015,7 +2045,7 @@ int force_object_loose(const struct object_id *oid, time_t mtime) if (!buf) return error(_("cannot read object for %s"), oid_to_hex(oid)); hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX , type_name(type), (uintmax_t)len) + 1; - ret = write_loose_object(oid, hdr, hdrlen, buf, len, mtime, 0); + ret = write_loose_object((struct object_id*) oid, hdr, hdrlen, buf, len, mtime, 0); free(buf); return ret; From patchwork Fri Dec 10 10:34:32 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12669205 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD777C433F5 for ; Fri, 10 Dec 2021 10:35:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237990AbhLJKim (ORCPT ); Fri, 10 Dec 2021 05:38:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57930 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237753AbhLJKil (ORCPT ); Fri, 10 Dec 2021 05:38:41 -0500 Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com [IPv6:2607:f8b0:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 32395C061746 for ; Fri, 10 Dec 2021 02:35:07 -0800 (PST) Received: by mail-pf1-x42e.google.com with SMTP id x5so8180630pfr.0 for ; Fri, 10 Dec 2021 02:35:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=jU5XJfnHQH6Wp8fcR3m4XFy0zXzChs6IdGe7NS2r+5c=; b=cgo0y/qcW6mF0e6tE91AGrCyTAvzYEiRwU1j/H1IZnbT7wAuAp00Jr3LzjzQPHXQcO 4WhAgaBWYTOTFHasvIIliaS/bRyqFgaI2Jvf/VaUrWaAzcHVvOVosJyDyXTHW6rO1ijo xc3V9Y6VlQ/vcFowhbGvXxbprXxx26Lktx0lKaJp4/DQTBHOwrJtVkYk7rdWo2T29GKn I4wujCRlCCRRBIjX1KjS2LvsPlkOPBhG1c90wNbhQZ1Wyud/4mhfQ7fH0iFXWNqK+e6+ 19/xBN5qZONrFVSxTiH6amGctKW+qg67OuBmVL/vkow666CKHkjcLWZYeJu+Exmqwbt/ w/NQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=jU5XJfnHQH6Wp8fcR3m4XFy0zXzChs6IdGe7NS2r+5c=; b=VkqiYyteqofbJSF9q7bFB5tgtgKlSN4uPUu9/xIF1bS3RC2iTwKb5JDY/8YmKlD9Qv jU8BBO0W/QlTjhqJ8DO8Ez+p9S7A/AVcNdSOIUwbAkj0A8xcvKyiSvs845xZ3TZZj/99 nqc2QGUam/Ax6g8Jzz0yiD2/ISFnLAIdh7yaLnJKCYq8jI3StcJhCP8r5zPTmLMAbcVU tf4ZohVWaBXxSJzf8JJgO5g3OyDmseVVX9lmIzZXnS6ztYxj2bLrxTiRrM6MoeVjQnvj QTzyJvSrXZJPrNavx5GNJeoejz+ovE1MSeqskKS0dZ/WGCZoVdG6IwfvEJhh2RGaRncg c3Mg== X-Gm-Message-State: AOAM530sZL23QDskfpITajiSq3KI3S3a8m6/vQhiHlJ1vE8eZUh2xb7o v7EL1rv2Y4kxqMUoHH2Z6lOQrqj8uuzygQ== X-Google-Smtp-Source: ABdhPJziIzaklXVPglMRU9H4gBbTj251K5Xi0Yk0oGfk7Vwo+pi90IVD8Z6Pdz8VnNRbXivzdWEgaA== X-Received: by 2002:a63:8c0a:: with SMTP id m10mr39004118pgd.142.1639132506710; Fri, 10 Dec 2021 02:35:06 -0800 (PST) Received: from localhost.localdomain ([205.204.117.96]) by smtp.gmail.com with ESMTPSA id 204sm2396250pgb.63.2021.12.10.02.35.04 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 10 Dec 2021 02:35:06 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v5 3/6] object-file.c: read stream in a loop in write_loose_object() Date: Fri, 10 Dec 2021 18:34:32 +0800 Message-Id: <20211210103435.83656-4-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211203093530.93589-1-chiyutianyi@gmail.com> References: <20211203093530.93589-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin In order to prepare the stream version of "write_loose_object()", read the input stream in a loop in "write_loose_object()", so that we can feed the contents of large blob object to "write_loose_object()" using a small fixed buffer. Helped-by: Jiang Xin Signed-off-by: Han Xin --- object-file.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/object-file.c b/object-file.c index 41099b137f..455ab3c06e 100644 --- a/object-file.c +++ b/object-file.c @@ -1864,7 +1864,7 @@ static int write_loose_object(struct object_id *oid, char *hdr, int hdrlen, const void *buf, unsigned long len, time_t mtime, unsigned flags) { - int fd, ret, err = 0; + int fd, ret, err = 0, flush = 0; unsigned char compressed[4096]; git_zstream stream; git_hash_ctx c; @@ -1903,22 +1903,29 @@ static int write_loose_object(struct object_id *oid, char *hdr, the_hash_algo->update_fn(&c, hdr, hdrlen); /* Then the data itself.. */ - if (flags & HASH_STREAM) { - struct input_stream *in_stream = (struct input_stream *)buf; - stream.next_in = (void *)in_stream->read(in_stream, &len); - } else { + if (!(flags & HASH_STREAM)) { stream.next_in = (void *)buf; + stream.avail_in = len; + flush = Z_FINISH; } - stream.avail_in = len; do { unsigned char *in0 = stream.next_in; - ret = git_deflate(&stream, Z_FINISH); + if (flags & HASH_STREAM && !stream.avail_in) { + struct input_stream *in_stream = (struct input_stream *)buf; + const void *in = in_stream->read(in_stream, &stream.avail_in); + stream.next_in = (void *)in; + in0 = (unsigned char *)in; + /* All data has been read. */ + if (len + hdrlen == stream.total_in + stream.avail_in) + flush = Z_FINISH; + } + ret = git_deflate(&stream, flush); the_hash_algo->update_fn(&c, in0, stream.next_in - in0); if (write_buffer(fd, compressed, stream.next_out - compressed) < 0) die(_("unable to write loose object file")); stream.next_out = compressed; stream.avail_out = sizeof(compressed); - } while (ret == Z_OK); + } while (ret == Z_OK || ret == Z_BUF_ERROR); if (ret != Z_STREAM_END) die(_("unable to deflate new object %s (%d)"), oid_to_hex(oid), From patchwork Fri Dec 10 10:34:33 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12669207 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D5CBC433F5 for ; Fri, 10 Dec 2021 10:35:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237914AbhLJKip (ORCPT ); Fri, 10 Dec 2021 05:38:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57960 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237812AbhLJKio (ORCPT ); Fri, 10 Dec 2021 05:38:44 -0500 Received: from mail-pj1-x102d.google.com (mail-pj1-x102d.google.com [IPv6:2607:f8b0:4864:20::102d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BE61CC0617A1 for ; Fri, 10 Dec 2021 02:35:09 -0800 (PST) Received: by mail-pj1-x102d.google.com with SMTP id np6-20020a17090b4c4600b001a90b011e06so7139101pjb.5 for ; Fri, 10 Dec 2021 02:35:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=6CsBZVul8+xpJLD7ZM2E4q6o5VARdqprcFBptZ0+JBA=; b=m1yr/N3Uuk7EGf72NefoyQeXJzzJ/sgqPm9QZkVk4lZd+wOi8dCktEQWfYqGeu2Jjl BSQmGWcCfAzsYN3ZiaoyIN6bfviz775sSsl6gdiEvxxzecE5m96JmeEKfQ9FC8HOqbtL nelssZlc87+ZxknvhJMbj/0wqr5MvAVS5OtZhdvw925i1UAMTGYPfF9ERdPtz1FUz2ir xIewXn+1JPmI7gLvAitB6OSgVMkgNob64S3pUpB7b/9jm/3vobFYLp+eU0fL9n7RpFVb j4cPbi5mXSgVSik/C8EjgbNlGUZB2NZuAbThBxmeIimWt8gaXczP8yuebqUxs59FUWr0 QoMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=6CsBZVul8+xpJLD7ZM2E4q6o5VARdqprcFBptZ0+JBA=; b=jClzDYSQ+DfG7WbLA7bfrDeBClV7zpc1hBoN7/ao10GFNXnowFyKPbSBZ0OBpYkvy3 HxQFMp9Rcg2D62bnXuTq7kUcSfxQfdBfk/0gqwlbb4whjWSrS241KDW75hNPfk9CstOx Ex5aBvcGIRRuDEsRGyO43uL0X2ipmsYdRS7O7ynqfXGhZs2Jc21H1xv2w7Xd2UeCYKKs dFhOvr1LUvVT78WuhRIONkyJujisNd6yrR3ym3bnbdyjzQTZAixSqbfRYVICgn7pnqtw GBNkoT1NdArKL+3W5zGE5f/mBPMaz0ZmrkHr83JT+3h2kyzxzJHFwtmAPLh0EkziZ4Pp rjTA== X-Gm-Message-State: AOAM532Z8RhIO0IR0WveUKmmDrf35y7Zc8egKKbEUYXixQ6FcE03ino3 jqqoyhU26TirPPg6AXFHJVI= X-Google-Smtp-Source: ABdhPJzI9J2FEkNSg3a+xLODJuw3aEmgk6PPDw+F9YBg47PVmEQUINx4XzBn1Qm8PeYPyNrX12tM9w== X-Received: by 2002:a17:902:b084:b0:141:f5f8:1c5a with SMTP id p4-20020a170902b08400b00141f5f81c5amr75052680plr.40.1639132509291; Fri, 10 Dec 2021 02:35:09 -0800 (PST) Received: from localhost.localdomain ([205.204.117.96]) by smtp.gmail.com with ESMTPSA id 204sm2396250pgb.63.2021.12.10.02.35.06 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 10 Dec 2021 02:35:08 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v5 4/6] unpack-objects.c: add dry_run mode for get_data() Date: Fri, 10 Dec 2021 18:34:33 +0800 Message-Id: <20211210103435.83656-5-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211203093530.93589-1-chiyutianyi@gmail.com> References: <20211203093530.93589-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin In dry_run mode, "get_data()" is used to verify the inflation of data, and the returned buffer will not be used at all and will be freed immediately. Even in dry_run mode, it is dangerous to allocate a full-size buffer for a large blob object. Therefore, only allocate a low memory footprint when calling "get_data()" in dry_run mode. Suggested-by: Jiang Xin Signed-off-by: Han Xin --- builtin/unpack-objects.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 4a9466295b..d878e2f8b4 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -96,15 +96,16 @@ static void use(int bytes) display_throughput(progress, consumed_bytes); } -static void *get_data(unsigned long size) +static void *get_data(unsigned long size, int dry_run) { git_zstream stream; - void *buf = xmallocz(size); + unsigned long bufsize = dry_run ? 8192 : size; + void *buf = xmallocz(bufsize); memset(&stream, 0, sizeof(stream)); stream.next_out = buf; - stream.avail_out = size; + stream.avail_out = bufsize; stream.next_in = fill(1); stream.avail_in = len; git_inflate_init(&stream); @@ -124,6 +125,11 @@ static void *get_data(unsigned long size) } stream.next_in = fill(1); stream.avail_in = len; + if (dry_run) { + /* reuse the buffer in dry_run mode */ + stream.next_out = buf; + stream.avail_out = bufsize; + } } git_inflate_end(&stream); return buf; @@ -323,7 +329,7 @@ static void added_object(unsigned nr, enum object_type type, static void unpack_non_delta_entry(enum object_type type, unsigned long size, unsigned nr) { - void *buf = get_data(size); + void *buf = get_data(size, dry_run); if (!dry_run && buf) write_object(nr, type, buf, size); @@ -357,7 +363,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, if (type == OBJ_REF_DELTA) { oidread(&base_oid, fill(the_hash_algo->rawsz)); use(the_hash_algo->rawsz); - delta_data = get_data(delta_size); + delta_data = get_data(delta_size, dry_run); if (dry_run || !delta_data) { free(delta_data); return; @@ -396,7 +402,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, if (base_offset <= 0 || base_offset >= obj_list[nr].offset) die("offset value out of bound for delta base object"); - delta_data = get_data(delta_size); + delta_data = get_data(delta_size, dry_run); if (dry_run || !delta_data) { free(delta_data); return; From patchwork Fri Dec 10 10:34:34 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12669209 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F2C8C433EF for ; Fri, 10 Dec 2021 10:35:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237952AbhLJKis (ORCPT ); Fri, 10 Dec 2021 05:38:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57982 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237945AbhLJKiq (ORCPT ); Fri, 10 Dec 2021 05:38:46 -0500 Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 264D4C061A32 for ; Fri, 10 Dec 2021 02:35:12 -0800 (PST) Received: by mail-pj1-x102b.google.com with SMTP id y14-20020a17090a2b4e00b001a5824f4918so9112492pjc.4 for ; Fri, 10 Dec 2021 02:35:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=BcmKk2B73heq1ar1uDrlaD2DbDoGXCXY2XelAwTcYbE=; b=YWaKEg+fciKHOvUhdpw2cxrhR7p7vWNGNuHFrD7cxzs88ycC6LhskVmt2vBrzakM3a 4QzVVyDaGZcXmqYhemcn7Qn4DraTcXU0mZzvlhGh5o2f91weLvIg2iIhs149g4QTmtdX PXwFqMJeo8/FzvJ+FKGKaSxO0lARnNfYlvUDahVFhdTc1V+YD7ktq+yH6ADbwuy/6zPK 3mJpAw15QU7zxsk4enDg3xquJK9N9/91FLpZMBQs74HM+IGQ6vHSJh9HfWl4fA0JNwsL hkAIGkRyZmmLQNYv1qyC//ye02Zt1bD73ctC3I0N0+PKtJi6OGrzFC8s5DqZI5qyG7g3 zfCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=BcmKk2B73heq1ar1uDrlaD2DbDoGXCXY2XelAwTcYbE=; b=bjdNA+yLaNt/mQlQkt4cEdZMNrQE4w/Hycfv7g0YuBnmxts9iZmjUZGVUGoPAwFRtr pAAc54luJvqL3ZFNU5kq5x6qt63/PhA76Fj0djsZ/SAxmgicmGm/cCg00tYqbT95LM9x 51MAW7OoLcjohOEJOXGDzdVlyeDL+s59WDy9eWXjxfqITOdUBBcucT4PI8u1PZyzyK+B EbZrwRd5FBvp36aMY8OVLN1R3+JqEsYKqRlJoF77s7rJNT6ZqQTmgeke8TuW+pgobVJD DqA1CHM9gAPdbhFT9l8/K/OqphpY7CkAgOgzLaJsYSDuzDKdScTOLJW7h2jsdEOI7XQC 3OEA== X-Gm-Message-State: AOAM533hbBsrbc7iPvgZ5jLpEgTjZHHdVrLrL55jStNdPhKjobV/nwYI On4WWmi8aLU/pzEcAooYkR4= X-Google-Smtp-Source: ABdhPJwuAwXlok+mMrFwYHbEmcAzQrYFuCdBg4ZnRwqRVojXOQ+bdQngf60JOSY+H9HILlFJ/iNrDA== X-Received: by 2002:a17:902:c412:b0:141:f710:2a94 with SMTP id k18-20020a170902c41200b00141f7102a94mr75090104plk.1.1639132511704; Fri, 10 Dec 2021 02:35:11 -0800 (PST) Received: from localhost.localdomain ([205.204.117.96]) by smtp.gmail.com with ESMTPSA id 204sm2396250pgb.63.2021.12.10.02.35.09 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 10 Dec 2021 02:35:11 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v5 5/6] object-file.c: make "write_object_file_flags()" to support "HASH_STREAM" Date: Fri, 10 Dec 2021 18:34:34 +0800 Message-Id: <20211210103435.83656-6-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211203093530.93589-1-chiyutianyi@gmail.com> References: <20211203093530.93589-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin We will use "write_object_file_flags()" in "unpack_non_delta_entry()" to read the entire data contents in stream. When read in stream, we needn't prepare "oid" before "write_loose_object()", only generate the header. Signed-off-by: Han Xin --- object-file.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/object-file.c b/object-file.c index 455ab3c06e..906590dae5 100644 --- a/object-file.c +++ b/object-file.c @@ -2002,6 +2002,11 @@ int write_object_file_flags(const void *buf, unsigned long len, { char hdr[MAX_HEADER_LEN]; int hdrlen = sizeof(hdr); + if (flags & HASH_STREAM) { + /* Generate the header */ + hdrlen = xsnprintf(hdr, hdrlen, "%s %"PRIuMAX , type, (uintmax_t)len)+1; + return write_loose_object(oid, hdr, hdrlen, buf, len, 0, flags); + } /* Normally if we have it in the pack then we do not bother writing * it out into .git/objects/??/?{38} file. From patchwork Fri Dec 10 10:34:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12669211 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C940C433F5 for ; Fri, 10 Dec 2021 10:35:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237845AbhLJKiv (ORCPT ); Fri, 10 Dec 2021 05:38:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57994 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237753AbhLJKit (ORCPT ); Fri, 10 Dec 2021 05:38:49 -0500 Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B7600C0617A2 for ; Fri, 10 Dec 2021 02:35:14 -0800 (PST) Received: by mail-pl1-x62d.google.com with SMTP id m24so5975340pls.10 for ; Fri, 10 Dec 2021 02:35:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=PaJSC6y27DwFRXMb67RMHi84rf/+02Jm+aUFCi7CmuM=; b=BJUkamN4ri9KBkQDznZWAEQy06QRAdNFu7YaxdM1maT5fTcd1VUe+fyOusU2VCPR/3 EGepREHAwX+0tskXFC5uWQXD+61cc/XrL3IO/rA30I66GHF0A1K0v8Xp7110przmoA8Q 4uA5Q6LR0A/5i18WbKXc3xVTJ/xlnC4bINb+0VX221o8pPm2xb/RCeOkFTSBOtMvGNdC 0psiNylsh6t54Wa39L1EfDedYzNwE4caJXL/n/H1gp/JrtwPtZA0uqNRup4ENrDUIXao oRElxxHfotkKzMaYQ7BcYOqyiW5Z91cOMwMVjd4APYwZ2BskWvInjd7afFOH+V5o6S9X 7j+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PaJSC6y27DwFRXMb67RMHi84rf/+02Jm+aUFCi7CmuM=; b=sLPFIPfaG4gnFHgoGR5pNMM6wPsoUCoKK1V5s47lIoRY3hktN7Xw77BhF62+CeY6c/ 3l5fNN5MpATuPMH25q2aAwFCjL2nXltHImcN1++DKKDxCsPrO1uElcgJXpuLzrGQr3bO g/DEmVAWl/ef58kNkj2DVIhFn38Qp5sbHfBtkKU5HO7C6CCcsVq5+dpAcEIxfDwQKIcT CboaAcnZ2eltb/ML8998HDf/IJVpf/J/yP1RtiaIP75YCqJFsLWbJT9/SXQ35mISrMjb Y/E7ypXML4cUGf+2YR28o2AixmX09t6Ll0hn6SUArkndDn/bbeDJNl7irudjsiETqr6G 2wwA== X-Gm-Message-State: AOAM5305fqhsAqWJg8P07JCgokHQKGPJGMvncS8Qx6PIRL54AwnV91Tx 28nE57zpgR7x+6XPZV4v4Ys= X-Google-Smtp-Source: ABdhPJwXxidNeLnwkVAkd6xSsE6e3eXTQUn5HKdCitS1Dt5GSlbMd1As1T4Rl8vDjlv5z7o+c6cz1w== X-Received: by 2002:a17:902:dad2:b0:141:fbea:178d with SMTP id q18-20020a170902dad200b00141fbea178dmr74661519plx.78.1639132514256; Fri, 10 Dec 2021 02:35:14 -0800 (PST) Received: from localhost.localdomain ([205.204.117.96]) by smtp.gmail.com with ESMTPSA id 204sm2396250pgb.63.2021.12.10.02.35.11 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 10 Dec 2021 02:35:13 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v5 6/6] unpack-objects: unpack_non_delta_entry() read data in a stream Date: Fri, 10 Dec 2021 18:34:35 +0800 Message-Id: <20211210103435.83656-7-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211203093530.93589-1-chiyutianyi@gmail.com> References: <20211203093530.93589-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin We used to call "get_data()" in "unpack_non_delta_entry()" to read the entire contents of a blob object, no matter how big it is. This implementation may consume all the memory and cause OOM. By implementing a zstream version of input_stream interface, we can use a small fixed buffer for "unpack_non_delta_entry()". However, unpack non-delta objects from a stream instead of from an entrie buffer will have 10% performance penalty. Therefore, only unpack object larger than the "core.BigFileStreamingThreshold" in zstream. See the following benchmarks: hyperfine \ --setup \ 'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \ --prepare 'rm -rf dest.git && git init --bare dest.git' Summary './git -C dest.git -c core.bigfilethreshold=512m unpack-objects Helped-by: Derrick Stolee Helped-by: Jiang Xin Signed-off-by: Han Xin --- Documentation/config/core.txt | 11 +++++ builtin/unpack-objects.c | 70 ++++++++++++++++++++++++++++- cache.h | 1 + config.c | 5 +++ environment.c | 1 + t/t5590-unpack-non-delta-objects.sh | 70 +++++++++++++++++++++++++++++ 6 files changed, 157 insertions(+), 1 deletion(-) create mode 100755 t/t5590-unpack-non-delta-objects.sh diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index c04f62a54a..601b7a2418 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -424,6 +424,17 @@ be delta compressed, but larger binary media files won't be. + Common unit suffixes of 'k', 'm', or 'g' are supported. +core.bigFileStreamingThreshold:: + Files larger than this will be streamed out to a temporary + object file while being hashed, which will when be renamed + in-place to a loose object, particularly if the + `core.bigFileThreshold' setting dictates that they're always + written out as loose objects. ++ +Default is 128 MiB on all platforms. ++ +Common unit suffixes of 'k', 'm', or 'g' are supported. + core.excludesFile:: Specifies the pathname to the file that contains patterns to describe paths that are not meant to be tracked, in addition diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index d878e2f8b4..0df115ab0d 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -326,11 +326,79 @@ static void added_object(unsigned nr, enum object_type type, } } +struct input_zstream_data { + git_zstream *zstream; + unsigned char buf[8192]; + int status; +}; + +static const void *feed_input_zstream(struct input_stream *in_stream, unsigned long *readlen) +{ + struct input_zstream_data *data = in_stream->data; + git_zstream *zstream = data->zstream; + void *in = fill(1); + + if (!len || data->status == Z_STREAM_END) { + *readlen = 0; + return NULL; + } + + zstream->next_out = data->buf; + zstream->avail_out = sizeof(data->buf); + zstream->next_in = in; + zstream->avail_in = len; + + data->status = git_inflate(zstream, 0); + use(len - zstream->avail_in); + *readlen = sizeof(data->buf) - zstream->avail_out; + + return data->buf; +} + +static void write_stream_blob(unsigned nr, unsigned long size) +{ + git_zstream zstream; + struct input_zstream_data data; + struct input_stream in_stream = { + .read = feed_input_zstream, + .data = &data, + }; + int ret; + + memset(&zstream, 0, sizeof(zstream)); + memset(&data, 0, sizeof(data)); + data.zstream = &zstream; + git_inflate_init(&zstream); + + if ((ret = write_object_file_flags(&in_stream, size, type_name(OBJ_BLOB) ,&obj_list[nr].oid, HASH_STREAM))) + die(_("failed to write object in stream %d"), ret); + + if (zstream.total_out != size || data.status != Z_STREAM_END) + die(_("inflate returned %d"), data.status); + git_inflate_end(&zstream); + + if (strict && !dry_run) { + struct blob *blob = lookup_blob(the_repository, &obj_list[nr].oid); + if (blob) + blob->object.flags |= FLAG_WRITTEN; + else + die("invalid blob object from stream"); + } + obj_list[nr].obj = NULL; +} + static void unpack_non_delta_entry(enum object_type type, unsigned long size, unsigned nr) { - void *buf = get_data(size, dry_run); + void *buf; + + /* Write large blob in stream without allocating full buffer. */ + if (!dry_run && type == OBJ_BLOB && size > big_file_streaming_threshold) { + write_stream_blob(nr, size); + return; + } + buf = get_data(size, dry_run); if (!dry_run && buf) write_object(nr, type, buf, size); else diff --git a/cache.h b/cache.h index 51bd435dea..78548cd67a 100644 --- a/cache.h +++ b/cache.h @@ -965,6 +965,7 @@ extern size_t packed_git_window_size; extern size_t packed_git_limit; extern size_t delta_base_cache_limit; extern unsigned long big_file_threshold; +extern unsigned long big_file_streaming_threshold; extern unsigned long pack_size_limit_cfg; /* diff --git a/config.c b/config.c index c5873f3a70..7b122a142a 100644 --- a/config.c +++ b/config.c @@ -1408,6 +1408,11 @@ static int git_default_core_config(const char *var, const char *value, void *cb) return 0; } + if (!strcmp(var, "core.bigfilestreamingthreshold")) { + big_file_streaming_threshold = git_config_ulong(var, value); + return 0; + } + if (!strcmp(var, "core.packedgitlimit")) { packed_git_limit = git_config_ulong(var, value); return 0; diff --git a/environment.c b/environment.c index 9da7f3c1a1..4fcc3de741 100644 --- a/environment.c +++ b/environment.c @@ -46,6 +46,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; size_t delta_base_cache_limit = 96 * 1024 * 1024; unsigned long big_file_threshold = 512 * 1024 * 1024; +unsigned long big_file_streaming_threshold = 128 * 1024 * 1024; int pager_use_color = 1; const char *editor_program; const char *askpass_program; diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh new file mode 100755 index 0000000000..ff4c78900b --- /dev/null +++ b/t/t5590-unpack-non-delta-objects.sh @@ -0,0 +1,70 @@ +#!/bin/sh +# +# Copyright (c) 2021 Han Xin +# + +test_description='Test unpack-objects when receive pack' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./test-lib.sh + +prepare_dest () { + test_when_finished "rm -rf dest.git" && + git init --bare dest.git && + git -C dest.git config core.bigFileStreamingThreshold $1 + git -C dest.git config core.bigFileThreshold $1 +} + +test_expect_success "setup repo with big blobs (1.5 MB)" ' + test-tool genrandom foo 1500000 >big-blob && + test_commit --append foo big-blob && + test-tool genrandom bar 1500000 >big-blob && + test_commit --append bar big-blob && + ( + cd .git && + find objects/?? -type f | sort + ) >expect && + PACK=$(echo main | git pack-objects --revs test) +' + +test_expect_success 'setup env: GIT_ALLOC_LIMIT to 1MB' ' + GIT_ALLOC_LIMIT=1m && + export GIT_ALLOC_LIMIT +' + +test_expect_success 'fail to unpack-objects: cannot allocate' ' + prepare_dest 2m && + test_must_fail git -C dest.git unpack-objects err && + grep "fatal: attempting to allocate" err && + ( + cd dest.git && + find objects/?? -type f | sort + ) >actual && + test_file_not_empty actual && + ! test_cmp expect actual +' + +test_expect_success 'unpack big object in stream' ' + prepare_dest 1m && + git -C dest.git unpack-objects actual && + test_cmp expect actual +' + +test_expect_success 'unpack-objects dry-run' ' + prepare_dest 1m && + git -C dest.git unpack-objects -n actual && + test_must_be_empty actual +' + +test_done