From patchwork Mon Oct 23 22:44:56 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Taylor Blau <me@ttaylorr.com>
X-Patchwork-Id: 13433669
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B34A7224DE
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 22:44:59 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=ttaylorr-com.20230601.gappssmtp.com
 header.i=@ttaylorr-com.20230601.gappssmtp.com header.b="arSgZaDz"
Received: from mail-yb1-xb29.google.com (mail-yb1-xb29.google.com
 [IPv6:2607:f8b0:4864:20::b29])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3DE4D10D
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:44:58 -0700 (PDT)
Received: by mail-yb1-xb29.google.com with SMTP id
 3f1490d57ef6-d9b9adaf291so2981738276.1
        for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:44:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ttaylorr-com.20230601.gappssmtp.com; s=20230601; t=1698101097;
 x=1698705897; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=M69HJlHcyV0wktY2lfvWR+t+JHaDI+2avkioFDrDme8=;
        b=arSgZaDzAIiDIK4Svq6afuyz2VOC/s2fjC0P8KZGSdnn5RVVyaszZZs+Up6zAJg8g/
         AAbShMIRrdlAez8aIN/evSHNe36zXad4eDQmRidJfmnRwoKYOKpnA//eBR7DQ+4plrPD
         PtE7djgjouWGXHf/X/KIoRG6O9WWGDBheJVXBJyYwP4mO+DfhKDp12gYCTz4MpwQ04Ot
         REvoNtQenLxKYxp8oe5WzAc7MaxzNC4/aiZ5Y4pBzzpDTYuZdtEdMluRMr4dIOCc2/W5
         wnVMWax7GHYovguCuCwB/LzLtTBcFhO+hQRDVyNzWrSJRh1LAigppepCQvQKznJAN8aW
         88MA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698101097; x=1698705897;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=M69HJlHcyV0wktY2lfvWR+t+JHaDI+2avkioFDrDme8=;
        b=AEeo5xdECNNpwW1vErsuQgaqVwqi2SUJXi1Pg93tA/oTgIYtWwH2OWpH/RStQ/VGUu
         2BBZprU4LydvZq40hs4c8Ifah4+cnamIEpV8LioEIyNsddZlkrJ7sSSz7LJ4BFw8uh7M
         DQWZPosvOx9QQBah/gQXPiWVCUsI4/H7V3p4jcmdt1O7MhXBJs1oGRhLwEiL16du7grt
         GOHrP90Xmt8OXuW4jU0Djfx+Wwku2u0NOWmgHcg9EhcOiHsN2mPlnJbVThwv/WV0WZ4V
         69702UlpdJVK00CdG6eQuTmdfWfNb5Nn8hIaScGKVybw3QHFc7SiDH2tn9R6XqL/sk6f
         j2/w==
X-Gm-Message-State: AOJu0YwDuK3KqYNBks6loK3vqShF1uDZtsoKFLmxnFdTVyWRBpEzBJE2
	PcJeG+Yl/pXlsUIu0X2n1EDHrL7F+RFbpZrseH0jOw==
X-Google-Smtp-Source: 
 AGHT+IF/GPHnwBbFW2mZ1jE7FoStvMUazfz0RElwNDHu2rderW1QvDe1evx3zjMP6J7Tw3t/h79jbw==
X-Received: by 2002:a05:6902:1504:b0:d9a:b70c:d32b with SMTP id
 q4-20020a056902150400b00d9ab70cd32bmr10919532ybu.41.1698101097219;
        Mon, 23 Oct 2023 15:44:57 -0700 (PDT)
Received: from localhost (104-178-186-189.lightspeed.milwwi.sbcglobal.net.
 [104.178.186.189])
        by smtp.gmail.com with ESMTPSA id
 205-20020a2505d6000000b00d814d8dfd69sm3075560ybf.27.2023.10.23.15.44.56
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 23 Oct 2023 15:44:56 -0700 (PDT)
Date: Mon, 23 Oct 2023 18:44:56 -0400
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Elijah Newren <newren@gmail.com>,
	"Eric W. Biederman" <ebiederm@gmail.com>, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>, Patrick Steinhardt <ps@pks.im>
Subject: [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source`
Message-ID: 
 <696aa027e46ddec310812fad2d4b12082447d925.1698101088.git.me@ttaylorr.com>
References: <cover.1697736516.git.me@ttaylorr.com>
 <cover.1698101088.git.me@ttaylorr.com>
Precedence: bulk
X-Mailing-List: git@vger.kernel.org
List-Id: <git.vger.kernel.org>
List-Subscribe: <mailto:git+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:git+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <cover.1698101088.git.me@ttaylorr.com>

A future commit will want to implement a very similar routine as in
`stream_blob_to_pack()` with two notable changes:

  - Instead of streaming just OBJ_BLOBs, this new function may want to
    stream objects of arbitrary type.

  - Instead of streaming the object's contents from an open
    file-descriptor, this new function may want to "stream" its contents
    from memory.

To avoid duplicating a significant chunk of code between the existing
`stream_blob_to_pack()`, extract an abstract `bulk_checkin_source`. This
concept currently is a thin layer of `lseek()` and `read_in_full()`, but
will grow to understand how to perform analogous operations when writing
out an object's contents from memory.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 65 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 8 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6ce62999e5..174a6c24e4 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -140,8 +140,49 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
 	return 0;
 }
 
+struct bulk_checkin_source {
+	off_t (*read)(struct bulk_checkin_source *, void *, size_t);
+	off_t (*seek)(struct bulk_checkin_source *, off_t);
+
+	union {
+		struct {
+			int fd;
+		} from_fd;
+	} data;
+
+	size_t size;
+	const char *path;
+};
+
+static off_t bulk_checkin_source_read_from_fd(struct bulk_checkin_source *source,
+					      void *buf, size_t nr)
+{
+	return read_in_full(source->data.from_fd.fd, buf, nr);
+}
+
+static off_t bulk_checkin_source_seek_from_fd(struct bulk_checkin_source *source,
+					      off_t offset)
+{
+	return lseek(source->data.from_fd.fd, offset, SEEK_SET);
+}
+
+static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
+					     int fd, size_t size,
+					     const char *path)
+{
+	memset(source, 0, sizeof(struct bulk_checkin_source));
+
+	source->read = bulk_checkin_source_read_from_fd;
+	source->seek = bulk_checkin_source_seek_from_fd;
+
+	source->data.from_fd.fd = fd;
+
+	source->size = size;
+	source->path = path;
+}
+
 /*
- * Read the contents from fd for size bytes, streaming it to the
+ * Read the contents from 'source' for 'size' bytes, streaming it to the
  * packfile in state while updating the hash in ctx. Signal a failure
  * by returning a negative value when the resulting pack would exceed
  * the pack size limit and this is not the first object in the pack,
@@ -157,7 +198,7 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
  */
 static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 			       git_hash_ctx *ctx, off_t *already_hashed_to,
-			       int fd, size_t size, const char *path,
+			       struct bulk_checkin_source *source,
 			       unsigned flags)
 {
 	git_zstream s;
@@ -167,22 +208,27 @@ static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 	int status = Z_OK;
 	int write_object = (flags & HASH_WRITE_OBJECT);
 	off_t offset = 0;
+	size_t size = source->size;
 
 	git_deflate_init(&s, pack_compression_level);
 
-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB, size);
+	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
+					      size);
 	s.next_out = obuf + hdrlen;
 	s.avail_out = sizeof(obuf) - hdrlen;
 
 	while (status != Z_STREAM_END) {
 		if (size && !s.avail_in) {
 			ssize_t rsize = size < sizeof(ibuf) ? size : sizeof(ibuf);
-			ssize_t read_result = read_in_full(fd, ibuf, rsize);
+			ssize_t read_result;
+
+			read_result = source->read(source, ibuf, rsize);
 			if (read_result < 0)
-				die_errno("failed to read from '%s'", path);
+				die_errno("failed to read from '%s'",
+					  source->path);
 			if (read_result != rsize)
 				die("failed to read %d bytes from '%s'",
-				    (int)rsize, path);
+				    (int)rsize, source->path);
 			offset += rsize;
 			if (*already_hashed_to < offset) {
 				size_t hsize = offset - *already_hashed_to;
@@ -258,6 +304,9 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
+	struct bulk_checkin_source source;
+
+	init_bulk_checkin_source_from_fd(&source, fd, size, path);
 
 	seekback = lseek(fd, 0, SEEK_CUR);
 	if (seekback == (off_t) -1)
@@ -283,7 +332,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			crc32_begin(state->f);
 		}
 		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 fd, size, path, flags))
+					 &source, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
@@ -295,7 +344,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 		hashfile_truncate(state->f, &checkpoint);
 		state->offset = checkpoint.offset;
 		flush_bulk_checkin_packfile(state);
-		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
+		if (source.seek(&source, seekback) == (off_t)-1)
 			return error("cannot seek back");
 	}
 	the_hash_algo->final_oid_fn(result_oid, &ctx);

From patchwork Mon Oct 23 22:44:59 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Taylor Blau <me@ttaylorr.com>
X-Patchwork-Id: 13433670
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7892523760
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 22:45:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=ttaylorr-com.20230601.gappssmtp.com
 header.i=@ttaylorr-com.20230601.gappssmtp.com header.b="VPQehhIu"
Received: from mail-yw1-x112f.google.com (mail-yw1-x112f.google.com
 [IPv6:2607:f8b0:4864:20::112f])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0879C10E
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:45:01 -0700 (PDT)
Received: by mail-yw1-x112f.google.com with SMTP id
 00721157ae682-5a7c011e113so41993967b3.1
        for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:45:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ttaylorr-com.20230601.gappssmtp.com; s=20230601; t=1698101100;
 x=1698705900; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=PCwa04nllI+txxAefr6o6HeUKTygNGiXBI3Nkz211y8=;
        b=VPQehhIuOfsQJu7sytnFZLokTXiRpaGF6JsY6MpipvM8wvSSm4xIOi06GA8nAzxW/4
         2gcHlSP9W2iDuechufl4Bos9a7btgtltxfldNovd+wvNYnIA+JlxDyt84KMNV0Rdvfwq
         T4BkRkxrpD12wIvO/2bikTKNGqeu6WYAVxdyw05qLkljyJj1agfr5vL2kRocXIpwxjSy
         tibate3g9zFefq/VvHhRfGoyIFC/eP2dZq9GYuHVWhZNN4b+GT8jrhTEB7OYO5SzpEno
         VO6XEVV5sWf/TQHnRLgKClNvT+ZY8VRV5COqUedvnZl4gyXsq/2gXsx/XM0vDXGsCxNk
         bF5w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698101100; x=1698705900;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=PCwa04nllI+txxAefr6o6HeUKTygNGiXBI3Nkz211y8=;
        b=YdyYQQ0EA8dv7aCPSQ4f/z4LPIPlgNjuQB+EIrDpGuPvfUZi7J0dsYV2UHHf9Vkcdq
         LovO80JRD2r1zuFVTG79ePZOCpPjPWcFpXeKRGThPjy9/ypgXD5IuNxi0KqwZ0Kuej8D
         eC71gkFeGzJI5qZ9KgdIJpxF6knzHo+9bHUZksvhRRWmjpQZ7MSq1hQlatiNB31pmk08
         jut/NMZKCUTcTuckdpdTtTfEq+Bgog2kc4aVdkBU6ULhYyacqb9Bn8Y9ZB1TEY+dAtFh
         F9xSO5PTjgIVuAE3FXsYstmcccZhOy5iNQ31dEamjJQRT21rKEB8UFltop2pPqWjyt2k
         whaA==
X-Gm-Message-State: AOJu0YyQ6/veoRv0VZYvkRY7+yT2+qP3XAvoJQ+Rig91h+0hwFSXKEPz
	DLBLi1i17SyY0xM653Pv24xRoxMegi6ZHB9GvujeAA==
X-Google-Smtp-Source: 
 AGHT+IGslSKc+WVPmno1V03qKSZAhHjsAI9v3Jw8Ag72SvceWq61ejaNTmeVh+1KQnb62oka7BXlEA==
X-Received: by 2002:a0d:eb4a:0:b0:5a7:ec86:fc84 with SMTP id
 u71-20020a0deb4a000000b005a7ec86fc84mr11554492ywe.21.1698101100019;
        Mon, 23 Oct 2023 15:45:00 -0700 (PDT)
Received: from localhost (104-178-186-189.lightspeed.milwwi.sbcglobal.net.
 [104.178.186.189])
        by smtp.gmail.com with ESMTPSA id
 a23-20020a0dd817000000b005a8dbe385d1sm3515928ywe.12.2023.10.23.15.44.59
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 23 Oct 2023 15:44:59 -0700 (PDT)
Date: Mon, 23 Oct 2023 18:44:59 -0400
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Elijah Newren <newren@gmail.com>,
	"Eric W. Biederman" <ebiederm@gmail.com>, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>, Patrick Steinhardt <ps@pks.im>
Subject: [PATCH v5 2/5] bulk-checkin: generify `stream_blob_to_pack()` for
 arbitrary types
Message-ID: 
 <596bd028a74f45c8f7ecf46dc5eb25f45ff5f523.1698101088.git.me@ttaylorr.com>
References: <cover.1697736516.git.me@ttaylorr.com>
 <cover.1698101088.git.me@ttaylorr.com>
Precedence: bulk
X-Mailing-List: git@vger.kernel.org
List-Id: <git.vger.kernel.org>
List-Subscribe: <mailto:git+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:git+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <cover.1698101088.git.me@ttaylorr.com>

The existing `stream_blob_to_pack()` function is named based on the fact
that it knows only how to stream blobs into a bulk-checkin pack.

But there is no longer anything in this function which prevents us from
writing objects of arbitrary types to the bulk-checkin pack. Prepare to
write OBJ_TREEs by removing this assumption, adding an `enum
object_type` parameter to this function's argument list, and renaming it
to `stream_obj_to_pack()` accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 61 +++++++++++++++++++++++++++++---------------------
 1 file changed, 36 insertions(+), 25 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 174a6c24e4..79776e679e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -196,10 +196,10 @@ static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
  * status before calling us just in case we ask it to call us again
  * with a new pack.
  */
-static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
-			       git_hash_ctx *ctx, off_t *already_hashed_to,
-			       struct bulk_checkin_source *source,
-			       unsigned flags)
+static int stream_obj_to_pack(struct bulk_checkin_packfile *state,
+			      git_hash_ctx *ctx, off_t *already_hashed_to,
+			      struct bulk_checkin_source *source,
+			      enum object_type type, unsigned flags)
 {
 	git_zstream s;
 	unsigned char ibuf[16384];
@@ -212,8 +212,7 @@ static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 
 	git_deflate_init(&s, pack_compression_level);
 
-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
-					      size);
+	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), type, size);
 	s.next_out = obuf + hdrlen;
 	s.avail_out = sizeof(obuf) - hdrlen;
 
@@ -293,27 +292,23 @@ static void prepare_to_stream(struct bulk_checkin_packfile *state,
 		die_errno("unable to write pack header");
 }
 
-static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
-				struct object_id *result_oid,
-				int fd, size_t size,
-				const char *path, unsigned flags)
+
+static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
+			       struct object_id *result_oid,
+			       struct bulk_checkin_source *source,
+			       enum object_type type,
+			       off_t seekback,
+			       unsigned flags)
 {
-	off_t seekback, already_hashed_to;
+	off_t already_hashed_to = 0;
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
-	struct bulk_checkin_source source;
 
-	init_bulk_checkin_source_from_fd(&source, fd, size, path);
-
-	seekback = lseek(fd, 0, SEEK_CUR);
-	if (seekback == (off_t) -1)
-		return error("cannot find the current offset");
-
-	header_len = format_object_header((char *)obuf, sizeof(obuf),
-					  OBJ_BLOB, size);
+	header_len = format_object_header((char *)obuf, sizeof(obuf), type,
+					  source->size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
 	the_hash_algo->init_fn(&checkpoint.ctx);
@@ -322,8 +317,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	if ((flags & HASH_WRITE_OBJECT) != 0)
 		CALLOC_ARRAY(idx, 1);
 
-	already_hashed_to = 0;
-
 	while (1) {
 		prepare_to_stream(state, flags);
 		if (idx) {
@@ -331,8 +324,8 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			idx->offset = state->offset;
 			crc32_begin(state->f);
 		}
-		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 &source, flags))
+		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
+					source, type, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
@@ -344,7 +337,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 		hashfile_truncate(state->f, &checkpoint);
 		state->offset = checkpoint.offset;
 		flush_bulk_checkin_packfile(state);
-		if (source.seek(&source, seekback) == (off_t)-1)
+		if (source->seek(source, seekback) == (off_t)-1)
 			return error("cannot seek back");
 	}
 	the_hash_algo->final_oid_fn(result_oid, &ctx);
@@ -366,6 +359,24 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	return 0;
 }
 
+static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
+				struct object_id *result_oid,
+				int fd, size_t size,
+				const char *path, unsigned flags)
+{
+	struct bulk_checkin_source source;
+	off_t seekback;
+
+	init_bulk_checkin_source_from_fd(&source, fd, size, path);
+
+	seekback = lseek(fd, 0, SEEK_CUR);
+	if (seekback == (off_t) -1)
+		return error("cannot find the current offset");
+
+	return deflate_obj_to_pack(state, result_oid, &source, OBJ_BLOB,
+				   seekback, flags);
+}
+
 void prepare_loose_object_bulk_checkin(void)
 {
 	/*

From patchwork Mon Oct 23 22:45:01 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Taylor Blau <me@ttaylorr.com>
X-Patchwork-Id: 13433671
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FF41241F7
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 22:45:04 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=ttaylorr-com.20230601.gappssmtp.com
 header.i=@ttaylorr-com.20230601.gappssmtp.com header.b="XQORGZpb"
Received: from mail-yb1-xb35.google.com (mail-yb1-xb35.google.com
 [IPv6:2607:f8b0:4864:20::b35])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 815F8DF
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:45:03 -0700 (PDT)
Received: by mail-yb1-xb35.google.com with SMTP id
 3f1490d57ef6-d81d09d883dso3587128276.0
        for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:45:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ttaylorr-com.20230601.gappssmtp.com; s=20230601; t=1698101102;
 x=1698705902; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=wZhWzU306bJ5fvb6pmzE4F5FDInCiZk0yBmAdk6EUMg=;
        b=XQORGZpbBSpVgDmRWAPHC50Atl7aPnjzbaTCaMmTOXvwAiP2vchWzxo1vYeM5QVkrB
         v1xAIKUnZJ43TKZWWELUdKbCv3ycENHGf/zgzDgEC5NguKhl+cQpWuCKGPO7kseWjqtC
         YwUkHMsRVNkCY5lsHuol+1kyakEeHybjsg2bIHq18O8WinayAR7lseqcN3fD25AQXNec
         HScLaneLZxSyv0SNKeRfPxX65ehBsFiq0eUsZGkoeikOS5QNNvgv7fJgJtRlIgKsjI0N
         CHcuDMg/oRMrs+N3lTYA2h+O2YXzu5hNgokUJmlMv7iXdJZAfiuK/W1ibtHJbIV5ctKH
         Cf7A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698101102; x=1698705902;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=wZhWzU306bJ5fvb6pmzE4F5FDInCiZk0yBmAdk6EUMg=;
        b=rp0u27FpSVIWcedFaUJNDqaeJmxVgLaHtgjq8CqUla39GY+hCnYzAitYgo+qtKXUDz
         rOSLyAte/93eHZhOqJtT8RLUmlrfG23Azsj8bm4C5zeNggLXJBexmMFqmXKDHZmHzHJ0
         BH5WAjlaQFdjLuaA6NheIv6O6CyR59NU2VdqR0huURkED/eK8CbJMKDuP/QF1gxqEvW4
         kLdHm829yDnQuzXQrvp7A8xuNgitW2MkxhSK/G2G7NE7+hDYnauQNTh4EW8+iIqqHTYX
         ja6TwvAukb2y2E66kEztcnyYgd64UpLApv+Qvq9X7EBgIf3U3wSdbJYUMAd+Vz8/ty8F
         DmDg==
X-Gm-Message-State: AOJu0YzHKdF4u27EXh2a/Tm46Ui7EWaoX8IwzMT1wahOuu7k0QGv15mo
	fnHXjt1basCujp4pCtFTcz1NwRm1lfnUpxOYFz7hJA==
X-Google-Smtp-Source: 
 AGHT+IEQ00WeGI16h7XXkAkFQCSBJttQ6BwCFdOngxYLZIYS4ZbVeNVF+kFQRmjadK6yuBj9lvj1oA==
X-Received: by 2002:a25:d20e:0:b0:d9c:a485:332b with SMTP id
 j14-20020a25d20e000000b00d9ca485332bmr9755641ybg.4.1698101102513;
        Mon, 23 Oct 2023 15:45:02 -0700 (PDT)
Received: from localhost (104-178-186-189.lightspeed.milwwi.sbcglobal.net.
 [104.178.186.189])
        by smtp.gmail.com with ESMTPSA id
 x17-20020a25ac91000000b00d995a8b956csm3067865ybi.51.2023.10.23.15.45.02
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 23 Oct 2023 15:45:02 -0700 (PDT)
Date: Mon, 23 Oct 2023 18:45:01 -0400
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Elijah Newren <newren@gmail.com>,
	"Eric W. Biederman" <ebiederm@gmail.com>, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>, Patrick Steinhardt <ps@pks.im>
Subject: [PATCH v5 3/5] bulk-checkin: introduce
 `index_blob_bulk_checkin_incore()`
Message-ID: 
 <d8cf8e4395375f88fe4e1ade2b79a3be6ce5fb12.1698101088.git.me@ttaylorr.com>
References: <cover.1697736516.git.me@ttaylorr.com>
 <cover.1698101088.git.me@ttaylorr.com>
Precedence: bulk
X-Mailing-List: git@vger.kernel.org
List-Id: <git.vger.kernel.org>
List-Subscribe: <mailto:git+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:git+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <cover.1698101088.git.me@ttaylorr.com>

Introduce `index_blob_bulk_checkin_incore()` which allows streaming
arbitrary blob contents from memory into the bulk-checkin pack.

In order to support streaming from a location in memory, we must
implement a new kind of bulk_checkin_source that does just that. These
implementation in spread out across:

  - init_bulk_checkin_source_incore()
  - bulk_checkin_source_read_incore()
  - bulk_checkin_source_seek_incore()

Note that, unlike file descriptors, which manage their own offset
internally, we have to keep track of how many bytes we've read out of
the buffer, and make sure we don't read past the end of the buffer.

This will be useful in a couple of more commits in order to provide the
`merge-tree` builtin with a mechanism to create a new pack containing
any objects it created during the merge, instead of storing those
objects individually as loose.

Similar to the existing `index_blob_bulk_checkin()` function, the
entrypoint delegates to `deflate_obj_to_pack_incore()`. That function in
turn delegates to deflate_obj_to_pack(), which is responsible for
formatting the pack header and then deflating the contents into the
pack.

Consistent with the rest of the bulk-checkin mechanism, there are no
direct tests here. In future commits when we expose this new
functionality via the `merge-tree` builtin, we will test it indirectly
there.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++
 bulk-checkin.h |  4 +++
 2 files changed, 79 insertions(+)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 79776e679e..b728210bc7 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -148,6 +148,10 @@ struct bulk_checkin_source {
 		struct {
 			int fd;
 		} from_fd;
+		struct {
+			const void *buf;
+			size_t nr_read;
+		} incore;
 	} data;
 
 	size_t size;
@@ -166,6 +170,36 @@ static off_t bulk_checkin_source_seek_from_fd(struct bulk_checkin_source *source
 	return lseek(source->data.from_fd.fd, offset, SEEK_SET);
 }
 
+static off_t bulk_checkin_source_read_incore(struct bulk_checkin_source *source,
+					     void *buf, size_t nr)
+{
+	const unsigned char *src = source->data.incore.buf;
+
+	if (source->data.incore.nr_read > source->size)
+		BUG("read beyond bulk-checkin source buffer end "
+		    "(%"PRIuMAX" > %"PRIuMAX")",
+		    (uintmax_t)source->data.incore.nr_read,
+		    (uintmax_t)source->size);
+
+	if (nr > source->size - source->data.incore.nr_read)
+		nr = source->size - source->data.incore.nr_read;
+
+	src += source->data.incore.nr_read;
+
+	memcpy(buf, src, nr);
+	source->data.incore.nr_read += nr;
+	return nr;
+}
+
+static off_t bulk_checkin_source_seek_incore(struct bulk_checkin_source *source,
+					     off_t offset)
+{
+	if (!(0 <= offset && offset < source->size))
+		return (off_t)-1;
+	source->data.incore.nr_read = offset;
+	return source->data.incore.nr_read;
+}
+
 static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
 					     int fd, size_t size,
 					     const char *path)
@@ -181,6 +215,22 @@ static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
 	source->path = path;
 }
 
+static void init_bulk_checkin_source_incore(struct bulk_checkin_source *source,
+					    const void *buf, size_t size,
+					    const char *path)
+{
+	memset(source, 0, sizeof(struct bulk_checkin_source));
+
+	source->read = bulk_checkin_source_read_incore;
+	source->seek = bulk_checkin_source_seek_incore;
+
+	source->data.incore.buf = buf;
+	source->data.incore.nr_read = 0;
+
+	source->size = size;
+	source->path = path;
+}
+
 /*
  * Read the contents from 'source' for 'size' bytes, streaming it to the
  * packfile in state while updating the hash in ctx. Signal a failure
@@ -359,6 +409,19 @@ static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
 	return 0;
 }
 
+static int deflate_obj_to_pack_incore(struct bulk_checkin_packfile *state,
+				       struct object_id *result_oid,
+				       const void *buf, size_t size,
+				       const char *path, enum object_type type,
+				       unsigned flags)
+{
+	struct bulk_checkin_source source;
+
+	init_bulk_checkin_source_incore(&source, buf, size, path);
+
+	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
+}
+
 static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 				struct object_id *result_oid,
 				int fd, size_t size,
@@ -421,6 +484,18 @@ int index_blob_bulk_checkin(struct object_id *oid,
 	return status;
 }
 
+int index_blob_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags)
+{
+	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
+						buf, size, path, OBJ_BLOB,
+						flags);
+	if (!odb_transaction_nesting)
+		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
+	return status;
+}
+
 void begin_odb_transaction(void)
 {
 	odb_transaction_nesting += 1;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index aa7286a7b3..1b91daeaee 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -13,6 +13,10 @@ int index_blob_bulk_checkin(struct object_id *oid,
 			    int fd, size_t size,
 			    const char *path, unsigned flags);
 
+int index_blob_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags);
+
 /*
  * Tell the object database to optimize for adding
  * multiple objects. end_odb_transaction must be called

From patchwork Mon Oct 23 22:45:04 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Taylor Blau <me@ttaylorr.com>
X-Patchwork-Id: 13433672
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3426024202
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 22:45:07 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=ttaylorr-com.20230601.gappssmtp.com
 header.i=@ttaylorr-com.20230601.gappssmtp.com header.b="MtyFNImZ"
Received: from mail-ot1-x32c.google.com (mail-ot1-x32c.google.com
 [IPv6:2607:f8b0:4864:20::32c])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 19CC910C
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:45:06 -0700 (PDT)
Received: by mail-ot1-x32c.google.com with SMTP id
 46e09a7af769-6ce353df504so2676390a34.3
        for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:45:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ttaylorr-com.20230601.gappssmtp.com; s=20230601; t=1698101105;
 x=1698705905; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=frZHwzCGV5v5rhRlLDXV4Q612DnPcvX7ixs4V4d4tSQ=;
        b=MtyFNImZXx8KluxqywKVJoEHJNHnF0/l5znt0EJkfwuV1rXyrEus0EMLz+zM0emEo/
         5Ul5PSlkxiT36DgNTnNWrW2gpDG7GD8BlDr7TpHBsKcZUMiQHVexxuT/MNl4DiXq/syW
         fpWuQ7MIc8Yz1/PvvSDh0lXRvHLXs2L6jCsoU/xJU/N7yDsaW8fO4WHQ8K3Q5Rry/ueu
         t0a0bat3+Ld4PjWw2bCkcLMAwQOm0fumigoBAyvlB0obQb1+neL6XKybwnWoekCJGpzH
         cph2S9CiYdhLh8/Vqopgku67KqV4qTTBfiQrJZlmm43p+DDLbNSdjht3hEE9/bTOItrM
         8m5g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698101105; x=1698705905;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=frZHwzCGV5v5rhRlLDXV4Q612DnPcvX7ixs4V4d4tSQ=;
        b=lZ6o+AnmL62SOjJJEkl35ZXUYcFpdyDhBTzX/pgrW+9Bof8pwdlZb96qv3ZMKAnTo/
         0SFnx35G9S8tqzgGWSW2JSZ6MX0BZorNdbNwbgn77i5c7pys3Ka1qIesdkeXwDO/H+6R
         IHB7eefDQUfl7xtpbZECWyzpbHYCmJhAZ1wZXxxmuH0xP/jYjlPoTUL5mfwwbANiEJSs
         zYm5nXrDcfHlDljrX4w2btcKb4Zc5kVXJX2fsCDOMIsBGRSM75eF8/SPo3wvvYh+/kgC
         UAKRaGpO38z0MEvP/xcMDBL7mhr1/i4NVojNs87AuQW2hz91hFB90Pd+qYrlACBszSYX
         MrJg==
X-Gm-Message-State: AOJu0YxhvdoHSOrUenprRX3I7ZtmPqVlX26vpxOP8/gH7bS7rvsVk1gM
	w7Ks7pbfrILjhnOKcW0T3+Mf9POCrtf902YhC80xMw==
X-Google-Smtp-Source: 
 AGHT+IFVDScx3rOM5km7eDa1pnB6LCY1/dWN5rGxFzudQGtcs88FgPjd5pmo6ZiHIzWG5Y5YHKGeUQ==
X-Received: by 2002:a05:6830:13cc:b0:6bd:b29:85d3 with SMTP id
 e12-20020a05683013cc00b006bd0b2985d3mr10971706otq.24.1698101105083;
        Mon, 23 Oct 2023 15:45:05 -0700 (PDT)
Received: from localhost (104-178-186-189.lightspeed.milwwi.sbcglobal.net.
 [104.178.186.189])
        by smtp.gmail.com with ESMTPSA id
 c64-20020a0dc143000000b0059a34cfa2a8sm3485296ywd.62.2023.10.23.15.45.04
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 23 Oct 2023 15:45:04 -0700 (PDT)
Date: Mon, 23 Oct 2023 18:45:04 -0400
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Elijah Newren <newren@gmail.com>,
	"Eric W. Biederman" <ebiederm@gmail.com>, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>, Patrick Steinhardt <ps@pks.im>
Subject: [PATCH v5 4/5] bulk-checkin: introduce
 `index_tree_bulk_checkin_incore()`
Message-ID: 
 <2670192802a904b42fb0c11c26c9f7311aa8dd90.1698101088.git.me@ttaylorr.com>
References: <cover.1697736516.git.me@ttaylorr.com>
 <cover.1698101088.git.me@ttaylorr.com>
Precedence: bulk
X-Mailing-List: git@vger.kernel.org
List-Id: <git.vger.kernel.org>
List-Subscribe: <mailto:git+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:git+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <cover.1698101088.git.me@ttaylorr.com>

The remaining missing piece in order to teach the `merge-tree` builtin
how to write the contents of a merge into a pack is a function to index
tree objects into a bulk-checkin pack.

This patch implements that missing piece, which is a thin wrapper around
all of the functionality introduced in previous commits.

If and when Git gains support for a "compatibility" hash algorithm, the
changes to support that here will be minimal. The bulk-checkin machinery
will need to convert the incoming tree to compute its length under the
compatibility hash, necessary to reconstruct its header. With that
information (and the converted contents of the tree), the bulk-checkin
machinery will have enough to keep track of the converted object's hash
in order to update the compatibility mapping.

Within some thin wrapper around `deflate_obj_to_pack_incore()` (perhaps
`deflate_tree_to_pack_incore()`), the changes should be limited to
something like:

    struct strbuf converted = STRBUF_INIT;
    if (the_repository->compat_hash_algo) {
      if (convert_object_file(&compat_obj,
                              the_repository->hash_algo,
                              the_repository->compat_hash_algo, ...) < 0)
        die(...);

      format_object_header_hash(the_repository->compat_hash_algo,
                                OBJ_TREE, size);
    }
    /* compute the converted tree's hash using the compat algorithm */
    strbuf_release(&converted);

, assuming related changes throughout the rest of the bulk-checkin
machinery necessary to update the hash of the converted object, which
are likewise minimal in size.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 12 ++++++++++++
 bulk-checkin.h |  4 ++++
 2 files changed, 16 insertions(+)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index b728210bc7..bd6151ba3c 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -496,6 +496,18 @@ int index_blob_bulk_checkin_incore(struct object_id *oid,
 	return status;
 }
 
+int index_tree_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags)
+{
+	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
+						buf, size, path, OBJ_TREE,
+						flags);
+	if (!odb_transaction_nesting)
+		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
+	return status;
+}
+
 void begin_odb_transaction(void)
 {
 	odb_transaction_nesting += 1;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index 1b91daeaee..89786b3954 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -17,6 +17,10 @@ int index_blob_bulk_checkin_incore(struct object_id *oid,
 				   const void *buf, size_t size,
 				   const char *path, unsigned flags);
 
+int index_tree_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags);
+
 /*
  * Tell the object database to optimize for adding
  * multiple objects. end_odb_transaction must be called

From patchwork Mon Oct 23 22:45:06 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Taylor Blau <me@ttaylorr.com>
X-Patchwork-Id: 13433673
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 619992420B
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 22:45:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=ttaylorr-com.20230601.gappssmtp.com
 header.i=@ttaylorr-com.20230601.gappssmtp.com header.b="qf0KV4K8"
Received: from mail-yw1-x1133.google.com (mail-yw1-x1133.google.com
 [IPv6:2607:f8b0:4864:20::1133])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B0D43DF
	for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:45:08 -0700 (PDT)
Received: by mail-yw1-x1133.google.com with SMTP id
 00721157ae682-5a877e0f0d8so39989437b3.1
        for <git@vger.kernel.org>; Mon, 23 Oct 2023 15:45:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ttaylorr-com.20230601.gappssmtp.com; s=20230601; t=1698101108;
 x=1698705908; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=VGIK5oS/upej6rMx7BDsxeHhw0gWuhqfUCFAPKKMnk4=;
        b=qf0KV4K8xKM4aJJ6WHZZYq3AtyFdqp2CGGeK4O+JMCBPs/QQKYh7ZcoOlRcDLzH2Vx
         GjTckLvLNYml7SbbLb1gNpzTZo+0+H1gL7IMh7pAVsRKHYGaBBTK4dGc/CH998fO430z
         JODVBivaNt/Yel2I50L85JzeRPwXZgaXQAxhqiwvBdGyfxA4MjS12stjhtLUAN61gXuA
         bXqh2//9x6fp/tjbmdQPmI9wPo7xddXtcZYtms8kegUzMEWe1Om1me6CyeweAL+HBlO1
         en0NDX27ZrGHhkKvULFb5uiTe+LzLuzLV2dt5HnrOGXQ8rG5dVpqsPgd2+zMNuEIW3gJ
         bkmA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698101108; x=1698705908;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=VGIK5oS/upej6rMx7BDsxeHhw0gWuhqfUCFAPKKMnk4=;
        b=UqpsJnOQNC++kceN7vA6IcNsl0VFr7EjkRtmjEVyiM9vTULyOQRnHa1O8X92TZUnL3
         YDfPsmnVOX2+vNmXK5lr1GwICk3gj1G/GTZ5q7JPy8pHOXZK7ZLhxR+eOM2a9Pk2Yhes
         eK9vMOGo2NhnHtfWdsFOhYgCj2jL2qAB/Tj/fgOVc/PJYDcUE2JQw68cHl4J/BSnVHwV
         XCj2JVJVvWo7skmFhGbnoDEMAudjkzrzz9Caq/vJrh/OTgi1a17cSWkPhKb2orW2MiH1
         bGDLxkJKzXMTRAh9HF5+dqEGlFzstbYTpXh3oIpTzBFc0tBzLvDIeUnfx0gubmez2usg
         yMBw==
X-Gm-Message-State: AOJu0YzJzgxgiXzwYAykQI6FrwoFN0byg+qH+Gq3B2UtxnxJYXpJjoRa
	GAK9jI+PMcCeX/emi47I/tJ8r7B2fnH0N02mCwBCpg==
X-Google-Smtp-Source: 
 AGHT+IG/MZozWvKO6pJb8XgoELY9Wry+wQWe9BLtCxhDIKwVgYPxjlo+K0aTTf42m1VKJBve8/O6bg==
X-Received: by 2002:a0d:e24b:0:b0:56c:e480:2b2b with SMTP id
 l72-20020a0de24b000000b0056ce4802b2bmr13513250ywe.12.1698101107693;
        Mon, 23 Oct 2023 15:45:07 -0700 (PDT)
Received: from localhost (104-178-186-189.lightspeed.milwwi.sbcglobal.net.
 [104.178.186.189])
        by smtp.gmail.com with ESMTPSA id
 w65-20020a817b44000000b005869fd2b5bcsm3515636ywc.127.2023.10.23.15.45.07
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 23 Oct 2023 15:45:07 -0700 (PDT)
Date: Mon, 23 Oct 2023 18:45:06 -0400
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Elijah Newren <newren@gmail.com>,
	"Eric W. Biederman" <ebiederm@gmail.com>, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>, Patrick Steinhardt <ps@pks.im>
Subject: [PATCH v5 5/5] builtin/merge-tree.c: implement support for
 `--write-pack`
Message-ID: 
 <3595db76a525fcebc3c896e231246704b044310c.1698101088.git.me@ttaylorr.com>
References: <cover.1697736516.git.me@ttaylorr.com>
 <cover.1698101088.git.me@ttaylorr.com>
Precedence: bulk
X-Mailing-List: git@vger.kernel.org
List-Id: <git.vger.kernel.org>
List-Subscribe: <mailto:git+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:git+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <cover.1698101088.git.me@ttaylorr.com>

When using merge-tree often within a repository[^1], it is possible to
generate a relatively large number of loose objects, which can result in
degraded performance, and inode exhaustion in extreme cases.

Building on the functionality introduced in previous commits, the
bulk-checkin machinery now has support to write arbitrary blob and tree
objects which are small enough to be held in-core. We can use this to
write any blob/tree objects generated by ORT into a separate pack
instead of writing them out individually as loose.

This functionality is gated behind a new `--write-pack` option to
`merge-tree` that works with the (non-deprecated) `--write-tree` mode.

The implementation is relatively straightforward. There are two spots
within the ORT mechanism where we call `write_object_file()`, one for
content differences within blobs, and another to assemble any new trees
necessary to construct the merge. In each of those locations,
conditionally replace calls to `write_object_file()` with
`index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
depending on which kind of object we are writing.

The only remaining task is to begin and end the transaction necessary to
initialize the bulk-checkin machinery, and move any new pack(s) it
created into the main object store.

[^1]: Such is the case at GitHub, where we run presumptive "test merges"
  on open pull requests to see whether or not we can light up the merge
  button green depending on whether or not the presumptive merge was
  conflicted.

  This is done in response to a number of user-initiated events,
  including viewing an open pull request whose last test merge is stale
  with respect to the current base and tip of the pull request. As a
  result, merge-tree can be run very frequently on large, active
  repositories.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-merge-tree.txt |  4 ++
 builtin/merge-tree.c             |  5 ++
 merge-ort.c                      | 42 +++++++++++----
 merge-recursive.h                |  1 +
 t/t4301-merge-tree-write-tree.sh | 93 ++++++++++++++++++++++++++++++++
 5 files changed, 136 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-merge-tree.txt b/Documentation/git-merge-tree.txt
index ffc4fbf7e8..9d37609ef1 100644
--- a/Documentation/git-merge-tree.txt
+++ b/Documentation/git-merge-tree.txt
@@ -69,6 +69,10 @@ OPTIONS
 	specify a merge-base for the merge, and specifying multiple bases is
 	currently not supported. This option is incompatible with `--stdin`.
 
+--write-pack::
+	Write any new objects into a separate packfile instead of as
+	individual loose objects.
+
 [[OUTPUT]]
 OUTPUT
 ------
diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c
index a35e0452d6..218442ac9b 100644
--- a/builtin/merge-tree.c
+++ b/builtin/merge-tree.c
@@ -19,6 +19,7 @@
 #include "tree.h"
 #include "config.h"
 #include "strvec.h"
+#include "bulk-checkin.h"
 
 static int line_termination = '\n';
 
@@ -416,6 +417,7 @@ struct merge_tree_options {
 	int name_only;
 	int use_stdin;
 	struct merge_options merge_options;
+	int write_pack;
 };
 
 static int real_merge(struct merge_tree_options *o,
@@ -441,6 +443,7 @@ static int real_merge(struct merge_tree_options *o,
 				 _("not something we can merge"));
 
 	opt.show_rename_progress = 0;
+	opt.write_pack = o->write_pack;
 
 	opt.branch1 = branch1;
 	opt.branch2 = branch2;
@@ -553,6 +556,8 @@ int cmd_merge_tree(int argc, const char **argv, const char *prefix)
 			   N_("specify a merge-base for the merge")),
 		OPT_STRVEC('X', "strategy-option", &xopts, N_("option=value"),
 			N_("option for selected merge strategy")),
+		OPT_BOOL(0, "write-pack", &o.write_pack,
+			 N_("write new objects to a pack instead of as loose")),
 		OPT_END()
 	};
 
diff --git a/merge-ort.c b/merge-ort.c
index 3653725661..523577d71e 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -48,6 +48,7 @@
 #include "tree.h"
 #include "unpack-trees.h"
 #include "xdiff-interface.h"
+#include "bulk-checkin.h"
 
 /*
  * We have many arrays of size 3.  Whenever we have such an array, the
@@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
 		if ((merge_status < 0) || !result_buf.ptr)
 			ret = error(_("failed to execute internal merge"));
 
-		if (!ret &&
-		    write_object_file(result_buf.ptr, result_buf.size,
-				      OBJ_BLOB, &result->oid))
-			ret = error(_("unable to add %s to database"), path);
+		if (!ret) {
+			ret = opt->write_pack
+				? index_blob_bulk_checkin_incore(&result->oid,
+								 result_buf.ptr,
+								 result_buf.size,
+								 path, 1)
+				: write_object_file(result_buf.ptr,
+						    result_buf.size,
+						    OBJ_BLOB, &result->oid);
+			if (ret)
+				ret = error(_("unable to add %s to database"),
+					    path);
+		}
 
 		free(result_buf.ptr);
 		if (ret)
@@ -3597,7 +3607,8 @@ static int tree_entry_order(const void *a_, const void *b_)
 				 b->string, strlen(b->string), bmi->result.mode);
 }
 
-static int write_tree(struct object_id *result_oid,
+static int write_tree(struct merge_options *opt,
+		      struct object_id *result_oid,
 		      struct string_list *versions,
 		      unsigned int offset,
 		      size_t hash_size)
@@ -3631,8 +3642,14 @@ static int write_tree(struct object_id *result_oid,
 	}
 
 	/* Write this object file out, and record in result_oid */
-	if (write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid))
+	ret = opt->write_pack
+		? index_tree_bulk_checkin_incore(result_oid,
+						 buf.buf, buf.len, "", 1)
+		: write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid);
+
+	if (ret)
 		ret = -1;
+
 	strbuf_release(&buf);
 	return ret;
 }
@@ -3797,8 +3814,8 @@ static int write_completed_directory(struct merge_options *opt,
 		 */
 		dir_info->is_null = 0;
 		dir_info->result.mode = S_IFDIR;
-		if (write_tree(&dir_info->result.oid, &info->versions, offset,
-			       opt->repo->hash_algo->rawsz) < 0)
+		if (write_tree(opt, &dir_info->result.oid, &info->versions,
+			       offset, opt->repo->hash_algo->rawsz) < 0)
 			ret = -1;
 	}
 
@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
 		fflush(stdout);
 		BUG("dir_metadata accounting completely off; shouldn't happen");
 	}
-	if (write_tree(result_oid, &dir_metadata.versions, 0,
+	if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
 		       opt->repo->hash_algo->rawsz) < 0)
 		ret = -1;
+
+	if (opt->write_pack)
+		end_odb_transaction();
+
 cleanup:
 	string_list_clear(&plist, 0);
 	string_list_clear(&dir_metadata.versions, 0);
@@ -4878,6 +4899,9 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 	 */
 	strmap_init(&opt->priv->conflicts);
 
+	if (opt->write_pack)
+		begin_odb_transaction();
+
 	trace2_region_leave("merge", "allocate/init", opt->repo);
 }
 
diff --git a/merge-recursive.h b/merge-recursive.h
index 3d3b3e3c29..5c5ff380a8 100644
--- a/merge-recursive.h
+++ b/merge-recursive.h
@@ -48,6 +48,7 @@ struct merge_options {
 	unsigned renormalize : 1;
 	unsigned record_conflict_msgs_as_headers : 1;
 	const char *msg_header_prefix;
+	unsigned write_pack : 1;
 
 	/* internal fields used by the implementation */
 	struct merge_options_internal *priv;
diff --git a/t/t4301-merge-tree-write-tree.sh b/t/t4301-merge-tree-write-tree.sh
index b2c8a43fce..d2a8634523 100755
--- a/t/t4301-merge-tree-write-tree.sh
+++ b/t/t4301-merge-tree-write-tree.sh
@@ -945,4 +945,97 @@ test_expect_success 'check the input format when --stdin is passed' '
 	test_cmp expect actual
 '
 
+packdir=".git/objects/pack"
+
+test_expect_success 'merge-tree can pack its result with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+
+	# base has lines [3, 4, 5]
+	#   - side adds to the beginning, resulting in [1, 2, 3, 4, 5]
+	#   - other adds to the end, resulting in [3, 4, 5, 6, 7]
+	#
+	# merging the two should result in a new blob object containing
+	# [1, 2, 3, 4, 5, 6, 7], along with a new tree.
+	test_commit -C repo base file "$(test_seq 3 5)" &&
+	git -C repo branch -M main &&
+	git -C repo checkout -b side main &&
+	test_commit -C repo side file "$(test_seq 1 5)" &&
+	git -C repo checkout -b other main &&
+	test_commit -C repo other file "$(test_seq 3 7)" &&
+
+	find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
+	tree="$(git -C repo merge-tree --write-pack \
+		refs/tags/side refs/tags/other)" &&
+	blob="$(git -C repo rev-parse $tree:file)" &&
+	find repo/$packdir -type f -name "pack-*.idx" >packs.after &&
+
+	test_must_be_empty packs.before &&
+	test_line_count = 1 packs.after &&
+
+	git show-index <$(cat packs.after) >objects &&
+	test_line_count = 2 objects &&
+	grep "^[1-9][0-9]* $tree" objects &&
+	grep "^[1-9][0-9]* $blob" objects
+'
+
+test_expect_success 'merge-tree can write multiple packs with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		git config pack.packSizeLimit 512 &&
+
+		test_seq 512 >f &&
+
+		# "f" contains roughly ~2,000 bytes.
+		#
+		# Each side ("foo" and "bar") adds a small amount of data at the
+		# beginning and end of "base", respectively.
+		git add f &&
+		test_tick &&
+		git commit -m base &&
+		git branch -M main &&
+
+		git checkout -b foo main &&
+		{
+			echo foo && cat f
+		} >f.tmp &&
+		mv f.tmp f &&
+		git add f &&
+		test_tick &&
+		git commit -m foo &&
+
+		git checkout -b bar main &&
+		echo bar >>f &&
+		git add f &&
+		test_tick &&
+		git commit -m bar &&
+
+		find $packdir -type f -name "pack-*.idx" >packs.before &&
+		# Merging either side should result in a new object which is
+		# larger than 1M, thus the result should be split into two
+		# separate packs.
+		tree="$(git merge-tree --write-pack \
+			refs/heads/foo refs/heads/bar)" &&
+		blob="$(git rev-parse $tree:f)" &&
+		find $packdir -type f -name "pack-*.idx" >packs.after &&
+
+		test_must_be_empty packs.before &&
+		test_line_count = 2 packs.after &&
+		for idx in $(cat packs.after)
+		do
+			git show-index <$idx || return 1
+		done >objects &&
+
+		# The resulting set of packs should contain one copy of both
+		# objects, each in a separate pack.
+		test_line_count = 2 objects &&
+		grep "^[1-9][0-9]* $tree" objects &&
+		grep "^[1-9][0-9]* $blob" objects
+
+	)
+'
+
 test_done