From patchwork Fri Dec 3 09:35:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12654551 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B092C433EF for ; Fri, 3 Dec 2021 09:36:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351406AbhLCJje (ORCPT ); Fri, 3 Dec 2021 04:39:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42204 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238201AbhLCJjc (ORCPT ); Fri, 3 Dec 2021 04:39:32 -0500 Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B2E4BC06173E for ; Fri, 3 Dec 2021 01:36:08 -0800 (PST) Received: by mail-pj1-x1031.google.com with SMTP id gt5so1922480pjb.1 for ; Fri, 03 Dec 2021 01:36:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=S0fSyWvuSH1lEGO4/AwXmgO6qm/qkgZ+kkMxuH3kXWk=; b=ozEG5mT8CeLE6qBXD6lp1WR+yvVEBIhrR080b5bIA7MiStqnJIxNV13qAmJ8smsLOE YmdY6nKh7vyYqIqBXSHnUilS46PxOLsr0gxwT5ErpPtDUkEPXTsvx8w3YUEHp7AKdeT5 vD5VMebvX56PHYPGsxPY46eRhxsFZlK1fgoVRa3g1jGA18ttEKMKxJ5ZIrXWEHdY/hTl spleTUyPMXjR2iiI1u0FXwDf167b3Z1LBZdi52TBJ6L3Ok/pcqZVpUSqT5z05IqAlurA Rck0FY6yxz1Y3EfCRypBc9oRtWf9J5OQOV6SNx/Gh8vV9xGDZ6Y11yZyI1FfPMhnZRrY M+uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=S0fSyWvuSH1lEGO4/AwXmgO6qm/qkgZ+kkMxuH3kXWk=; b=NMZ25YxXho0G40QhckqKibpll7khfwfUxW3NNbXT4Fb7Ux7HWthD5VZTYkg1e9L5RM pOp1gTR7ci4HHImw1G4Yfcs059eAtOiOV34+GkzaPvcV/8MDOVDW4OaF22EMosdwZP0S tzytxOobVhn790PiE3wl6Foi7PnPmsrrXrj0xbmI/o+eTtHap6HYJL2TAB8cXzniW5sU tPUkltSoQxDpkK45JMFp2Fkix736CEBoDqAG6H+bDeFaYERAVFEF0OVXPqxA5hmwnxM7 9UiAP50sRZ7/0Nu8ajDtHOQRhKt03peCg2rYu/TG/Z3AZ9UJ3wFKC6EE8d6ZlnPiUNTX RCHA== X-Gm-Message-State: AOAM5316mmsGPVuADKpgqtpGknylfQFXW5thBVCZZNcgkzOtXjNHeQ7q /JOoyCN/WvuW5G+H8T13ZrY= X-Google-Smtp-Source: ABdhPJxjg9wVGBj/d+p/RJbc5K+uKAnU9AI5UYcEYZKSTVO6dZ2GmbtsNB4brT94P+AIjTn8LDUFJg== X-Received: by 2002:a17:902:b615:b0:143:bbf0:aad0 with SMTP id b21-20020a170902b61500b00143bbf0aad0mr21565873pls.12.1638524168141; Fri, 03 Dec 2021 01:36:08 -0800 (PST) Received: from localhost.localdomain ([205.204.117.99]) by smtp.gmail.com with ESMTPSA id g9sm2708142pfj.160.2021.12.03.01.36.05 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Dec 2021 01:36:07 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v4 1/5] object-file: refactor write_loose_object() to read buffer from stream Date: Fri, 3 Dec 2021 17:35:26 +0800 Message-Id: <20211203093530.93589-2-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122033220.32883-1-chiyutianyi@gmail.com> References: <20211122033220.32883-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin We used to call "get_data()" in "unpack_non_delta_entry()" to read the entire contents of a blob object, no matter how big it is. This implementation may consume all the memory and cause OOM. This can be improved by feeding data to "write_loose_object()" in a stream. The input stream is implemented as an interface. In the first step, we make a simple implementation, feeding the entire buffer in the "stream" to "write_loose_object()" as a refactor. Helped-by: Jiang Xin Signed-off-by: Han Xin --- object-file.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++---- object-store.h | 6 ++++++ 2 files changed, 55 insertions(+), 4 deletions(-) diff --git a/object-file.c b/object-file.c index eb972cdccd..82656f7428 100644 --- a/object-file.c +++ b/object-file.c @@ -1860,8 +1860,26 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename) return fd; } +struct simple_input_stream_data { + const void *buf; + unsigned long len; +}; + +static const void *feed_simple_input_stream(struct input_stream *in_stream, unsigned long *len) +{ + struct simple_input_stream_data *data = in_stream->data; + + if (data->len == 0) { + *len = 0; + return NULL; + } + *len = data->len; + data->len = 0; + return data->buf; +} + static int write_loose_object(const struct object_id *oid, char *hdr, - int hdrlen, const void *buf, unsigned long len, + int hdrlen, struct input_stream *in_stream, time_t mtime, unsigned flags) { int fd, ret; @@ -1871,6 +1889,8 @@ static int write_loose_object(const struct object_id *oid, char *hdr, struct object_id parano_oid; static struct strbuf tmp_file = STRBUF_INIT; static struct strbuf filename = STRBUF_INIT; + const void *buf; + unsigned long len; loose_object_path(the_repository, &filename, oid); @@ -1898,6 +1918,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr, the_hash_algo->update_fn(&c, hdr, hdrlen); /* Then the data itself.. */ + buf = in_stream->read(in_stream, &len); stream.next_in = (void *)buf; stream.avail_in = len; do { @@ -1960,6 +1981,14 @@ int write_object_file_flags(const void *buf, unsigned long len, { char hdr[MAX_HEADER_LEN]; int hdrlen = sizeof(hdr); + struct input_stream in_stream = { + .read = feed_simple_input_stream, + .data = (void *)&(struct simple_input_stream_data) { + .buf = buf, + .len = len, + }, + .size = len, + }; /* Normally if we have it in the pack then we do not bother writing * it out into .git/objects/??/?{38} file. @@ -1968,7 +1997,7 @@ int write_object_file_flags(const void *buf, unsigned long len, &hdrlen); if (freshen_packed_object(oid) || freshen_loose_object(oid)) return 0; - return write_loose_object(oid, hdr, hdrlen, buf, len, 0, flags); + return write_loose_object(oid, hdr, hdrlen, &in_stream, 0, flags); } int hash_object_file_literally(const void *buf, unsigned long len, @@ -1977,6 +2006,14 @@ int hash_object_file_literally(const void *buf, unsigned long len, { char *header; int hdrlen, status = 0; + struct input_stream in_stream = { + .read = feed_simple_input_stream, + .data = (void *)&(struct simple_input_stream_data) { + .buf = buf, + .len = len, + }, + .size = len, + }; /* type string, SP, %lu of the length plus NUL must fit this */ hdrlen = strlen(type) + MAX_HEADER_LEN; @@ -1988,7 +2025,7 @@ int hash_object_file_literally(const void *buf, unsigned long len, goto cleanup; if (freshen_packed_object(oid) || freshen_loose_object(oid)) goto cleanup; - status = write_loose_object(oid, header, hdrlen, buf, len, 0, 0); + status = write_loose_object(oid, header, hdrlen, &in_stream, 0, 0); cleanup: free(header); @@ -2003,14 +2040,22 @@ int force_object_loose(const struct object_id *oid, time_t mtime) char hdr[MAX_HEADER_LEN]; int hdrlen; int ret; + struct simple_input_stream_data data; + struct input_stream in_stream = { + .read = feed_simple_input_stream, + .data = &data, + }; if (has_loose_object(oid)) return 0; buf = read_object(the_repository, oid, &type, &len); + in_stream.size = len; if (!buf) return error(_("cannot read object for %s"), oid_to_hex(oid)); + data.buf = buf; + data.len = len; hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX , type_name(type), (uintmax_t)len) + 1; - ret = write_loose_object(oid, hdr, hdrlen, buf, len, mtime, 0); + ret = write_loose_object(oid, hdr, hdrlen, &in_stream, mtime, 0); free(buf); return ret; diff --git a/object-store.h b/object-store.h index 952efb6a4b..a84d891d60 100644 --- a/object-store.h +++ b/object-store.h @@ -34,6 +34,12 @@ struct object_directory { char *path; }; +struct input_stream { + const void *(*read)(struct input_stream *, unsigned long *len); + void *data; + size_t size; +}; + KHASH_INIT(odb_path_map, const char * /* key: odb_path */, struct object_directory *, 1, fspathhash, fspatheq) From patchwork Fri Dec 3 09:35:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12654553 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29EC2C433FE for ; Fri, 3 Dec 2021 09:36:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351418AbhLCJjf (ORCPT ); Fri, 3 Dec 2021 04:39:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42220 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238201AbhLCJje (ORCPT ); Fri, 3 Dec 2021 04:39:34 -0500 Received: from mail-pg1-x52e.google.com (mail-pg1-x52e.google.com [IPv6:2607:f8b0:4864:20::52e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 510E1C06173E for ; Fri, 3 Dec 2021 01:36:11 -0800 (PST) Received: by mail-pg1-x52e.google.com with SMTP id 200so2503837pga.1 for ; Fri, 03 Dec 2021 01:36:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=skIAiRKZ183CeUNhWtFnoaHlnkxLZyLefNCqVeWgZjE=; b=XBTCfJdjyg0dWAhSPgaWc84ofhpsQka5iMNQMmwA7L8GM9iifjudAm1AKLeIkXlsJ0 hKYLG2y1cUfGq+Mg6OzWq8wV1TbDdEldnndVJnlfwvLU/tZfcrmvUkrywV3XGpcREZ1Q kccu8rgVhONqpVcmrnNaegIh/aOSv4FLyTv/NsE04RuNg1TPnP9w1glMOK/LgQUNI1AA FIW1IjHXZeBdoTdw+X0Sf/DKO07aUCEt6p5t5daezeVfnFkzXlsDEYcSodgqdSgJ+Jk1 F7wquw1QX+GyiJUyqDUglN8A2YJqabEZgCDdDxps6EIccdpfNszkCPTsJBsRSoHvlQzK 2OiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=skIAiRKZ183CeUNhWtFnoaHlnkxLZyLefNCqVeWgZjE=; b=1RtcWT/E4JLKvp5hYtV7c/MOVkxioDcvXYesylblWPbDdeFkoGV1FYoUEPnL4FJsAu /n8AVWUX0cNosc7cU6pi99WkdO9p3c4xUwNtWdLEP8bpZwv2gaXcCwYr4tZZ3ug/fijI tyUM7Y1NvdQFtU7fqPzzyJdAXMy7sHVmLdZHeEdAx738zWxQ6vF/zD18zESJlp+WdG/h zTG3T/wcjHIAOKovMYmHiOE4sJm2Gk0ybyBB8s+xOPGxVZ6M0kBERhMjBvI0HJF6kxZ8 cnFS9O+/mT6LTufxAOWQZ55+OKeFEPvsIeTNjLNIg1Wa1Lei4ylY7Uk/Ryb2jQIdiF5Y 5nXA== X-Gm-Message-State: AOAM530/iQG1Aw2SMGj3/HIhnAcvGWR54xP2VfBM10RyYOiBYpuWkGeS ULWCxUpjC603xtbzzncGRsA= X-Google-Smtp-Source: ABdhPJwf53+22Oa1t3i4ORq2d4vb3rJSXHxQ1nTsUFEeCJ1VxIvQIdV1k/HhwMmaR0VwABZrUIHQjw== X-Received: by 2002:a63:8c0a:: with SMTP id m10mr3756358pgd.142.1638524170765; Fri, 03 Dec 2021 01:36:10 -0800 (PST) Received: from localhost.localdomain ([205.204.117.99]) by smtp.gmail.com with ESMTPSA id g9sm2708142pfj.160.2021.12.03.01.36.08 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Dec 2021 01:36:10 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v4 2/5] object-file.c: handle undetermined oid in write_loose_object() Date: Fri, 3 Dec 2021 17:35:27 +0800 Message-Id: <20211203093530.93589-3-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122033220.32883-1-chiyutianyi@gmail.com> References: <20211122033220.32883-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin When streaming a large blob object to "write_loose_object()", we have no chance to run "write_object_file_prepare()" to calculate the oid in advance. So we need to handle undetermined oid in function "write_loose_object()". In the original implementation, we know the oid and we can write the temporary file in the same directory as the final object, but for an object with an undetermined oid, we don't know the exact directory for the object, so we have to save the temporary file in ".git/objects/" directory instead. Helped-by: Jiang Xin Signed-off-by: Han Xin --- object-file.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/object-file.c b/object-file.c index 82656f7428..1c41587bfb 100644 --- a/object-file.c +++ b/object-file.c @@ -1892,7 +1892,14 @@ static int write_loose_object(const struct object_id *oid, char *hdr, const void *buf; unsigned long len; - loose_object_path(the_repository, &filename, oid); + if (is_null_oid(oid)) { + /* When oid is not determined, save tmp file to odb path. */ + strbuf_reset(&filename); + strbuf_addstr(&filename, the_repository->objects->odb->path); + strbuf_addch(&filename, '/'); + } else { + loose_object_path(the_repository, &filename, oid); + } fd = create_tmpfile(&tmp_file, filename.buf); if (fd < 0) { @@ -1939,12 +1946,31 @@ static int write_loose_object(const struct object_id *oid, char *hdr, die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), ret); the_hash_algo->final_oid_fn(¶no_oid, &c); - if (!oideq(oid, ¶no_oid)) + if (!is_null_oid(oid) && !oideq(oid, ¶no_oid)) die(_("confused by unstable object source data for %s"), oid_to_hex(oid)); close_loose_object(fd); + if (is_null_oid(oid)) { + int dirlen; + + oidcpy((struct object_id *)oid, ¶no_oid); + loose_object_path(the_repository, &filename, oid); + + /* We finally know the object path, and create the missing dir. */ + dirlen = directory_size(filename.buf); + if (dirlen) { + struct strbuf dir = STRBUF_INIT; + strbuf_add(&dir, filename.buf, dirlen - 1); + if (mkdir(dir.buf, 0777) && errno != EEXIST) + return -1; + if (adjust_shared_perm(dir.buf)) + return -1; + strbuf_release(&dir); + } + } + if (mtime) { struct utimbuf utb; utb.actime = mtime; From patchwork Fri Dec 3 09:35:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12654555 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5FD6AC433F5 for ; Fri, 3 Dec 2021 09:36:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351425AbhLCJji (ORCPT ); Fri, 3 Dec 2021 04:39:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42248 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351436AbhLCJjh (ORCPT ); Fri, 3 Dec 2021 04:39:37 -0500 Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F650C061758 for ; Fri, 3 Dec 2021 01:36:13 -0800 (PST) Received: by mail-pj1-x1029.google.com with SMTP id gx15-20020a17090b124f00b001a695f3734aso2040816pjb.0 for ; Fri, 03 Dec 2021 01:36:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=gScyMy5WEdyKHwcbjQF64X523jAtZMtpE/tzSykJIf8=; b=UCTFbUtD0tKLMZ/YKFdTc6n/ESPx1v4BDOcjd7gcjv/XVeKx+0uO554XUPVvLPSdRd FRt8/vXDKrOAWFqYTH700gBj5DEEZnBMDiI2LYekhfT52eIcl18r9jhAqaLjMq5+oyPt aRUGPoA3gp52x9LQZv1IMh69OWiugtgt/4XOb5vkQ1aTgK2bWJngVI8mhTo2wmxM35L3 geRkRlB1ZvDZZVwfyGNlu4nhGQqXQZB0Uf9yrNeQNoPwQYprQDTjF4bGIWRw88VDW8Nc dHC/T0edW+fANh6xhW7Gh742gZ7hyiTpiGGlTrcdfrABhdPjejWXFqEEEeLzvakQZIjk jFPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=gScyMy5WEdyKHwcbjQF64X523jAtZMtpE/tzSykJIf8=; b=Rzucr7sxAy16R15/cpBZX78+IvLG8RhNNgDnD/sM3Bcd/Zae+1iMPAIpD+OaiKqbjA 0S8PDSYMU2YEpVCY/0VkxfjIX46bmnp/V1kyMch3JJ9UiEXaeI0PSKTwfZtLTgmRvl63 stJfXR4kMmuCRGZW1HqSOyk3yc7qsxvu8w+LktR2TAMuL7V4t/0DxeB8uz4QaPOBQ2Ly rjpsQTKSHI9tuVF4R7bZ+rxJT+rWTdh0AQv6ziMTrXFjcmzxf8z+CDainBvh37wvc7Ez fNrxgLXNJdj/nfKrPLUlmAGd4PUPe7H82UhhrMI1wRQZS0W7NdRpKP8g28JKVOipRUws wJ9g== X-Gm-Message-State: AOAM532beIGEg4iL2E0izJQNUbiqA1rWAgtfgn6HQqNVYpc7qPHwSBrt LQt68tJvu/zAKywLecdN+Fs= X-Google-Smtp-Source: ABdhPJy1aNHCHlM8oAzRPc5hSaNnmfs990MmM7SqP18nzHrxWB40BthpUXkzCZGjwx8I9viP9kIhvA== X-Received: by 2002:a17:902:e852:b0:142:19fe:982a with SMTP id t18-20020a170902e85200b0014219fe982amr22157645plg.13.1638524173404; Fri, 03 Dec 2021 01:36:13 -0800 (PST) Received: from localhost.localdomain ([205.204.117.99]) by smtp.gmail.com with ESMTPSA id g9sm2708142pfj.160.2021.12.03.01.36.11 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Dec 2021 01:36:12 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v4 3/5] object-file.c: read stream in a loop in write_loose_object() Date: Fri, 3 Dec 2021 17:35:28 +0800 Message-Id: <20211203093530.93589-4-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122033220.32883-1-chiyutianyi@gmail.com> References: <20211122033220.32883-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin In order to prepare the stream version of "write_loose_object()", read the input stream in a loop in "write_loose_object()", so that we can feed the contents of large blob object to "write_loose_object()" using a small fixed buffer. Helped-by: Jiang Xin Signed-off-by: Han Xin --- object-file.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/object-file.c b/object-file.c index 1c41587bfb..fa54e39c2c 100644 --- a/object-file.c +++ b/object-file.c @@ -1890,7 +1890,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr, static struct strbuf tmp_file = STRBUF_INIT; static struct strbuf filename = STRBUF_INIT; const void *buf; - unsigned long len; + int flush = 0; if (is_null_oid(oid)) { /* When oid is not determined, save tmp file to odb path. */ @@ -1925,18 +1925,23 @@ static int write_loose_object(const struct object_id *oid, char *hdr, the_hash_algo->update_fn(&c, hdr, hdrlen); /* Then the data itself.. */ - buf = in_stream->read(in_stream, &len); - stream.next_in = (void *)buf; - stream.avail_in = len; do { unsigned char *in0 = stream.next_in; - ret = git_deflate(&stream, Z_FINISH); + if (!stream.avail_in) { + buf = in_stream->read(in_stream, &stream.avail_in); + stream.next_in = (void *)buf; + in0 = (unsigned char *)buf; + /* All data has been read. */ + if (in_stream->size + hdrlen == stream.total_in + stream.avail_in) + flush = Z_FINISH; + } + ret = git_deflate(&stream, flush); the_hash_algo->update_fn(&c, in0, stream.next_in - in0); if (write_buffer(fd, compressed, stream.next_out - compressed) < 0) die(_("unable to write loose object file")); stream.next_out = compressed; stream.avail_out = sizeof(compressed); - } while (ret == Z_OK); + } while (ret == Z_OK || ret == Z_BUF_ERROR); if (ret != Z_STREAM_END) die(_("unable to deflate new object %s (%d)"), oid_to_hex(oid), From patchwork Fri Dec 3 09:35:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12654557 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CDF8FC433F5 for ; Fri, 3 Dec 2021 09:36:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351436AbhLCJjl (ORCPT ); Fri, 3 Dec 2021 04:39:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351430AbhLCJjk (ORCPT ); Fri, 3 Dec 2021 04:39:40 -0500 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 967ACC06173E for ; Fri, 3 Dec 2021 01:36:16 -0800 (PST) Received: by mail-pj1-x102c.google.com with SMTP id gb13-20020a17090b060d00b001a674e2c4a8so1985001pjb.4 for ; Fri, 03 Dec 2021 01:36:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=F5JudBKYEm4k2ej1j7f2t6+wlXtYbrLm19orpVsDdgw=; b=lTO5khGw6yg3WuQomUcernM/QxLBfKZaXvLy5AWu6loMePLCLycqbqTy+VRZxbHatw qQTU7Da6GXRIdgWMlmVXpkNkd1jTwZemcQbH/PzGJjJBGKu8htm4w9Y5ZSb0ZT7p7SRM QdY3ppAXIvLZ0glGh1DJNRfpX77ESJ4BEeZxFgoJUA+y6Ydit7JIhRJNBfLpWcyCkGAy uBnR5nH4vWnVrGrZKqlDIXSTWZZ13KVqoFuKB7aJwSZUx9WZFln4ZAi0FJ0XQAJFxNVv l5zNfwYYkdRaD1SyssJIJu5/QwWFJQ4bJ0vOrjARZ8rZ0pfWgrtmkPj68rMsGmQQHYGM w2ww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=F5JudBKYEm4k2ej1j7f2t6+wlXtYbrLm19orpVsDdgw=; b=w7xfCv0qindE3CMLAOIOMRwblKyNf4mIrRX9FVg647No5+vpi56Xl9jUQczQ8dI7qG WJUrxU1G9IE5Y0Uzzfh6xJZfXzyEoYneZEmEmLzEFHfOvXiT/NPlQ9HXLhJN3ndBfMRa 4OXKiDsijF7sG/15+5SREx1WTM5YGSIQhlC5I5YYaQ5C1SHxG43AIEWc8ybv5HewcVNs 4S4WaOkuiNLhwv2q2AhXL0aQufwRD7XnQc4rw4NuBhOQ+R5HzBZgDogtq9z04CsUizkx Gq+O+Mrz5hoRNLxeFH586Mk6sj9WlqDHG4ZYFzCGh8YtXjM8gtEL0SgVoNWd+EMrZUX8 riFQ== X-Gm-Message-State: AOAM530oGGn2Z8uk7GbJ58QVTxUqMJN864zVOROrQPFEYAH6jPSge5a9 vLc2Cj4qOORf7WzYk2sigV0= X-Google-Smtp-Source: ABdhPJy4lhJe8z+3RuXLXknxl/vEweKnwUeOcWzroqFBCRaZsJ8EJwy4qFEAmrnbnD0AI3OwikGlHA== X-Received: by 2002:a17:902:cecf:b0:141:e15d:4a2a with SMTP id d15-20020a170902cecf00b00141e15d4a2amr21626048plg.66.1638524176205; Fri, 03 Dec 2021 01:36:16 -0800 (PST) Received: from localhost.localdomain ([205.204.117.99]) by smtp.gmail.com with ESMTPSA id g9sm2708142pfj.160.2021.12.03.01.36.13 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Dec 2021 01:36:15 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v4 4/5] unpack-objects.c: add dry_run mode for get_data() Date: Fri, 3 Dec 2021 17:35:29 +0800 Message-Id: <20211203093530.93589-5-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122033220.32883-1-chiyutianyi@gmail.com> References: <20211122033220.32883-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin In dry_run mode, "get_data()" is used to verify the inflation of data, and the returned buffer will not be used at all and will be freed immediately. Even in dry_run mode, it is dangerous to allocate a full-size buffer for a large blob object. Therefore, only allocate a low memory footprint when calling "get_data()" in dry_run mode. Suggested-by: Jiang Xin Signed-off-by: Han Xin --- builtin/unpack-objects.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 4a9466295b..8d68acd662 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -96,15 +96,16 @@ static void use(int bytes) display_throughput(progress, consumed_bytes); } -static void *get_data(unsigned long size) +static void *get_data(unsigned long size, int dry_run) { git_zstream stream; - void *buf = xmallocz(size); + unsigned long bufsize = dry_run ? 4096 : size; + void *buf = xmallocz(bufsize); memset(&stream, 0, sizeof(stream)); stream.next_out = buf; - stream.avail_out = size; + stream.avail_out = bufsize; stream.next_in = fill(1); stream.avail_in = len; git_inflate_init(&stream); @@ -124,6 +125,11 @@ static void *get_data(unsigned long size) } stream.next_in = fill(1); stream.avail_in = len; + if (dry_run) { + /* reuse the buffer in dry_run mode */ + stream.next_out = buf; + stream.avail_out = bufsize; + } } git_inflate_end(&stream); return buf; @@ -323,7 +329,7 @@ static void added_object(unsigned nr, enum object_type type, static void unpack_non_delta_entry(enum object_type type, unsigned long size, unsigned nr) { - void *buf = get_data(size); + void *buf = get_data(size, dry_run); if (!dry_run && buf) write_object(nr, type, buf, size); @@ -357,7 +363,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, if (type == OBJ_REF_DELTA) { oidread(&base_oid, fill(the_hash_algo->rawsz)); use(the_hash_algo->rawsz); - delta_data = get_data(delta_size); + delta_data = get_data(delta_size, dry_run); if (dry_run || !delta_data) { free(delta_data); return; @@ -396,7 +402,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, if (base_offset <= 0 || base_offset >= obj_list[nr].offset) die("offset value out of bound for delta base object"); - delta_data = get_data(delta_size); + delta_data = get_data(delta_size, dry_run); if (dry_run || !delta_data) { free(delta_data); return; From patchwork Fri Dec 3 09:35:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12654559 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB6D5C433F5 for ; Fri, 3 Dec 2021 09:36:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351454AbhLCJjp (ORCPT ); Fri, 3 Dec 2021 04:39:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351442AbhLCJjn (ORCPT ); Fri, 3 Dec 2021 04:39:43 -0500 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 92EBCC06174A for ; Fri, 3 Dec 2021 01:36:19 -0800 (PST) Received: by mail-pj1-x1032.google.com with SMTP id gx15-20020a17090b124f00b001a695f3734aso2041077pjb.0 for ; Fri, 03 Dec 2021 01:36:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=0dSsBDfdNiD6rL5vyx1WZfKIIRj9ckGU1BaVc94dGMA=; b=CZ1zePXrV152vtdTRNMmSE34lw/sRq1UgR/3oNxLLHMWyrncEXpqFX7QQSo24eh8tE 4JJ+HHNUVKSXPJEZ4nyyn5T/+yObsOqHb4xDd53mli1xcpwgQxieEl5ChJ4+IvsNUhd2 lCy+iUs0pUim9ipIjus5cDUpOO+am0uL8mz4IooI2BH7r2IhIQGLKz4UTMG3KwZaWMpc LPS5btJE5td0Plw6tgiWX0B42plyl16w2X8NXzJ1Pw2+vN9572zxLTohsfJUkoX2L69m 0044UyMZcuNkFvT9tfFcB8Ut6IqL28/oxvqV3FmLuzO6E749H3qPPW3cgxCBJ7M0q6U2 xwvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=0dSsBDfdNiD6rL5vyx1WZfKIIRj9ckGU1BaVc94dGMA=; b=JIB+047F8NQS31fyIV8yo8ff/F9FaF2ykWciuiQu2qlcmF+l1JkRdjJV3ow1G8KOdE KULf8y6lgkW3zN+FX8DCcI/gVq6khMxY9KhV3wwzJW+9dip6lLKjDJClcnoTiYjQBLKh CFJHEgq2FCQTl1O0126ubyil0ncgxvww3BAWiOQl83TuGkCteF/zLBzPJTdh9UuYLKxO lI5v2O2tUVr98Fg8hzTa8KPJMtu5zOrxLlkoCkL/dug4Xzs964OSgWIq3BfvuZrYmVgr Djjpge0OSRU0RfQWqPn8q9SLZO3Ts7uOVlp8e/K/rCB+ZR+LOhF6DnJFYlcHeKySoTlT NRPw== X-Gm-Message-State: AOAM530t6gJsnsRKF3kAPtUGsnMVaA6R7JZDLjAhXXYvs0C88gFmHvRt UyvVCceqtcvboxnKnYhA3ozvR2rTxcFj+npe X-Google-Smtp-Source: ABdhPJwgxhBRwK9KKwkhNdy68edjr8475m/0QPwNpKxThj7rtn3R7H7JKMOFxshY2qTq5d2hZex0jg== X-Received: by 2002:a17:90b:17c4:: with SMTP id me4mr12727912pjb.15.1638524179052; Fri, 03 Dec 2021 01:36:19 -0800 (PST) Received: from localhost.localdomain ([205.204.117.99]) by smtp.gmail.com with ESMTPSA id g9sm2708142pfj.160.2021.12.03.01.36.16 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Dec 2021 01:36:18 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee Cc: Han Xin Subject: [PATCH v4 5/5] unpack-objects: unpack_non_delta_entry() read data in a stream Date: Fri, 3 Dec 2021 17:35:30 +0800 Message-Id: <20211203093530.93589-6-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122033220.32883-1-chiyutianyi@gmail.com> References: <20211122033220.32883-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin We used to call "get_data()" in "unpack_non_delta_entry()" to read the entire contents of a blob object, no matter how big it is. This implementation may consume all the memory and cause OOM. By implementing a zstream version of input_stream interface, we can use a small fixed buffer for "unpack_non_delta_entry()". However, unpack non-delta objects from a stream instead of from an entrie buffer will have 10% performance penalty. Therefore, only unpack object larger than the "big_file_threshold" in zstream. See the following benchmarks: hyperfine \ --setup \ 'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \ --prepare 'rm -rf dest.git && git init --bare dest.git' \ -n 'old' 'git -C dest.git unpack-objects Helped-by: Jiang Xin Signed-off-by: Han Xin --- builtin/unpack-objects.c | 77 ++++++++++++++++++++++++++++- object-file.c | 6 +-- object-store.h | 4 ++ t/t5590-unpack-non-delta-objects.sh | 76 ++++++++++++++++++++++++++++ 4 files changed, 159 insertions(+), 4 deletions(-) create mode 100755 t/t5590-unpack-non-delta-objects.sh diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 8d68acd662..bedc494e2d 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -326,11 +326,86 @@ static void added_object(unsigned nr, enum object_type type, } } +struct input_zstream_data { + git_zstream *zstream; + unsigned char buf[8192]; + int status; +}; + +static const void *feed_input_zstream(struct input_stream *in_stream, unsigned long *readlen) +{ + struct input_zstream_data *data = in_stream->data; + git_zstream *zstream = data->zstream; + void *in = fill(1); + + if (!len || data->status == Z_STREAM_END) { + *readlen = 0; + return NULL; + } + + zstream->next_out = data->buf; + zstream->avail_out = sizeof(data->buf); + zstream->next_in = in; + zstream->avail_in = len; + + data->status = git_inflate(zstream, 0); + use(len - zstream->avail_in); + *readlen = sizeof(data->buf) - zstream->avail_out; + + return data->buf; +} + +static void write_stream_blob(unsigned nr, unsigned long size) +{ + char hdr[32]; + int hdrlen; + git_zstream zstream; + struct input_zstream_data data; + struct input_stream in_stream = { + .read = feed_input_zstream, + .data = &data, + .size = size, + }; + struct object_id *oid = &obj_list[nr].oid; + int ret; + + memset(&zstream, 0, sizeof(zstream)); + memset(&data, 0, sizeof(data)); + data.zstream = &zstream; + git_inflate_init(&zstream); + + /* Generate the header */ + hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX, type_name(OBJ_BLOB), (uintmax_t)size) + 1; + + if ((ret = write_loose_object(oid, hdr, hdrlen, &in_stream, 0, 0))) + die(_("failed to write object in stream %d"), ret); + + if (zstream.total_out != size || data.status != Z_STREAM_END) + die(_("inflate returned %d"), data.status); + git_inflate_end(&zstream); + + if (strict && !dry_run) { + struct blob *blob = lookup_blob(the_repository, oid); + if (blob) + blob->object.flags |= FLAG_WRITTEN; + else + die("invalid blob object from stream"); + } + obj_list[nr].obj = NULL; +} + static void unpack_non_delta_entry(enum object_type type, unsigned long size, unsigned nr) { - void *buf = get_data(size, dry_run); + void *buf; + + /* Write large blob in stream without allocating full buffer. */ + if (!dry_run && type == OBJ_BLOB && size > big_file_threshold) { + write_stream_blob(nr, size); + return; + } + buf = get_data(size, dry_run); if (!dry_run && buf) write_object(nr, type, buf, size); else diff --git a/object-file.c b/object-file.c index fa54e39c2c..71d510614b 100644 --- a/object-file.c +++ b/object-file.c @@ -1878,9 +1878,9 @@ static const void *feed_simple_input_stream(struct input_stream *in_stream, unsi return data->buf; } -static int write_loose_object(const struct object_id *oid, char *hdr, - int hdrlen, struct input_stream *in_stream, - time_t mtime, unsigned flags) +int write_loose_object(const struct object_id *oid, char *hdr, + int hdrlen, struct input_stream *in_stream, + time_t mtime, unsigned flags) { int fd, ret; unsigned char compressed[4096]; diff --git a/object-store.h b/object-store.h index a84d891d60..ac5b11ec16 100644 --- a/object-store.h +++ b/object-store.h @@ -229,6 +229,10 @@ int hash_object_file(const struct git_hash_algo *algo, const void *buf, unsigned long len, const char *type, struct object_id *oid); +int write_loose_object(const struct object_id *oid, char *hdr, + int hdrlen, struct input_stream *in_stream, + time_t mtime, unsigned flags); + int write_object_file_flags(const void *buf, unsigned long len, const char *type, struct object_id *oid, unsigned flags); diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh new file mode 100755 index 0000000000..01d950d119 --- /dev/null +++ b/t/t5590-unpack-non-delta-objects.sh @@ -0,0 +1,76 @@ +#!/bin/sh +# +# Copyright (c) 2021 Han Xin +# + +test_description='Test unpack-objects when receive pack' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./test-lib.sh + +test_expect_success "create commit with big blobs (1.5 MB)" ' + test-tool genrandom foo 1500000 >big-blob && + test_commit --append foo big-blob && + test-tool genrandom bar 1500000 >big-blob && + test_commit --append bar big-blob && + ( + cd .git && + find objects/?? -type f | sort + ) >expect && + PACK=$(echo main | git pack-objects --progress --revs test) +' + +test_expect_success 'setup GIT_ALLOC_LIMIT to 1MB' ' + GIT_ALLOC_LIMIT=1m && + export GIT_ALLOC_LIMIT +' + +test_expect_success 'prepare dest repository' ' + git init --bare dest.git && + git -C dest.git config core.bigFileThreshold 2m && + git -C dest.git config receive.unpacklimit 100 +' + +test_expect_success 'fail to unpack-objects: cannot allocate' ' + test_must_fail git -C dest.git unpack-objects err && + test_i18ngrep "fatal: attempting to allocate" err && + ( + cd dest.git && + find objects/?? -type f | sort + ) >actual && + ! test_cmp expect actual +' + +test_expect_success 'set a lower bigfile threshold' ' + git -C dest.git config core.bigFileThreshold 1m +' + +test_expect_success 'unpack big object in stream' ' + git -C dest.git unpack-objects actual && + test_cmp expect actual +' + +test_expect_success 'setup for unpack-objects dry-run test' ' + git init --bare unpack-test.git +' + +test_expect_success 'unpack-objects dry-run' ' + ( + cd unpack-test.git && + git unpack-objects -n <../test-$PACK.pack + ) && + ( + cd unpack-test.git && + find objects/ -type f + ) >actual && + test_must_be_empty actual +' + +test_done