diff mbox series

[v4,4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`

Message ID e427fe6ad383cc238c13f313dc9f11eab37a3840.1697736516.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series merge-ort: implement support for packing objects together | expand

Commit Message

Taylor Blau Oct. 19, 2023, 5:28 p.m. UTC
Continue to prepare for streaming an object's contents directly from
memory by teaching `bulk_checkin_source` how to perform reads and seeks
based on an address in memory.

Unlike file descriptors, which manage their own offset internally, we
have to keep track of how many bytes we've read out of the buffer, and
make sure we don't read past the end of the buffer.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

Comments

Patrick Steinhardt Oct. 23, 2023, 9:19 a.m. UTC | #1
On Thu, Oct 19, 2023 at 01:28:51PM -0400, Taylor Blau wrote:
> Continue to prepare for streaming an object's contents directly from
> memory by teaching `bulk_checkin_source` how to perform reads and seeks
> based on an address in memory.
> 
> Unlike file descriptors, which manage their own offset internally, we
> have to keep track of how many bytes we've read out of the buffer, and
> make sure we don't read past the end of the buffer.
> 
> Suggested-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  bulk-checkin.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 28bc8d5ab4..60361b3e2e 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -141,11 +141,15 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
>  }
>  
>  struct bulk_checkin_source {
> -	enum { SOURCE_FILE } type;
> +	enum { SOURCE_FILE, SOURCE_INCORE } type;
>  
>  	/* SOURCE_FILE fields */
>  	int fd;
>  
> +	/* SOURCE_INCORE fields */
> +	const void *buf;
> +	size_t read;
> +
>  	/* common fields */
>  	size_t size;
>  	const char *path;
> @@ -157,6 +161,11 @@ static off_t bulk_checkin_source_seek_to(struct bulk_checkin_source *source,
>  	switch (source->type) {
>  	case SOURCE_FILE:
>  		return lseek(source->fd, offset, SEEK_SET);
> +	case SOURCE_INCORE:
> +		if (!(0 <= offset && offset < source->size))
> +			return (off_t)-1;
> +		source->read = offset;
> +		return source->read;
>  	default:
>  		BUG("unknown bulk-checkin source: %d", source->type);
>  	}
> @@ -168,6 +177,13 @@ static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
>  	switch (source->type) {
>  	case SOURCE_FILE:
>  		return read_in_full(source->fd, buf, nr);
> +	case SOURCE_INCORE:
> +		assert(source->read <= source->size);

Is there any guideline around when to use `assert()` vs `BUG()`? I think
that this assertion here is quite critical, because when it does not
hold we can end up performing out-of-bounds reads and writes. But as
asserts are typically missing in non-debug builds, this safeguard would
not do anything for our end users, right?

Patrick

> +		if (nr > source->size - source->read)
> +			nr = source->size - source->read;
> +		memcpy(buf, (unsigned char *)source->buf + source->read, nr);
> +		source->read += nr;
> +		return nr;
>  	default:
>  		BUG("unknown bulk-checkin source: %d", source->type);
>  	}
> -- 
> 2.42.0.405.g86fe3250c2
>
Jeff King Oct. 23, 2023, 6:58 p.m. UTC | #2
On Mon, Oct 23, 2023 at 11:19:13AM +0200, Patrick Steinhardt wrote:

> > +	case SOURCE_INCORE:
> > +		assert(source->read <= source->size);
> 
> Is there any guideline around when to use `assert()` vs `BUG()`? I think
> that this assertion here is quite critical, because when it does not
> hold we can end up performing out-of-bounds reads and writes. But as
> asserts are typically missing in non-debug builds, this safeguard would
> not do anything for our end users, right?

I don't think we have a written guideline. My philosophy is: always use
BUG(), because you will never be surprised that the assertion was not
compiled in (and I think compiling without assertions is almost
certainly premature optimization, especially given the way we tend to
use them).

-Peff
Patrick Steinhardt Oct. 24, 2023, 6:34 a.m. UTC | #3
On Mon, Oct 23, 2023 at 02:58:42PM -0400, Jeff King wrote:
> On Mon, Oct 23, 2023 at 11:19:13AM +0200, Patrick Steinhardt wrote:
> 
> > > +	case SOURCE_INCORE:
> > > +		assert(source->read <= source->size);
> > 
> > Is there any guideline around when to use `assert()` vs `BUG()`? I think
> > that this assertion here is quite critical, because when it does not
> > hold we can end up performing out-of-bounds reads and writes. But as
> > asserts are typically missing in non-debug builds, this safeguard would
> > not do anything for our end users, right?
> 
> I don't think we have a written guideline. My philosophy is: always use
> BUG(), because you will never be surprised that the assertion was not
> compiled in (and I think compiling without assertions is almost
> certainly premature optimization, especially given the way we tend to
> use them).
> 
> -Peff

I'm inclined to agree with your philosophy. Makes me wonder whether we
should write a Coccinelle rule to catch this. But a quick-and-dirty grep
in our codebase shows that such a rule would cause quite a lot of churn:

$ git grep BUG\( | wc -l
677
$ git grep assert\( | wc -l
549

Probably not worth it.

Patrick
Junio C Hamano Oct. 24, 2023, 5:08 p.m. UTC | #4
Patrick Steinhardt <ps@pks.im> writes:

> I'm inclined to agree with your philosophy. Makes me wonder whether we
> should write a Coccinelle rule to catch this. But a quick-and-dirty grep
> in our codebase shows that such a rule would cause quite a lot of churn:
>
> $ git grep BUG\( | wc -l
> 677
> $ git grep assert\( | wc -l
> 549
>
> Probably not worth it.

Yeah, we can stick to our usual "do not add new instances, fix them
while touching near-by code" pattern for this one, I would say.

Thanks.
diff mbox series

Patch

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 28bc8d5ab4..60361b3e2e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -141,11 +141,15 @@  static int already_written(struct bulk_checkin_packfile *state, struct object_id
 }
 
 struct bulk_checkin_source {
-	enum { SOURCE_FILE } type;
+	enum { SOURCE_FILE, SOURCE_INCORE } type;
 
 	/* SOURCE_FILE fields */
 	int fd;
 
+	/* SOURCE_INCORE fields */
+	const void *buf;
+	size_t read;
+
 	/* common fields */
 	size_t size;
 	const char *path;
@@ -157,6 +161,11 @@  static off_t bulk_checkin_source_seek_to(struct bulk_checkin_source *source,
 	switch (source->type) {
 	case SOURCE_FILE:
 		return lseek(source->fd, offset, SEEK_SET);
+	case SOURCE_INCORE:
+		if (!(0 <= offset && offset < source->size))
+			return (off_t)-1;
+		source->read = offset;
+		return source->read;
 	default:
 		BUG("unknown bulk-checkin source: %d", source->type);
 	}
@@ -168,6 +177,13 @@  static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
 	switch (source->type) {
 	case SOURCE_FILE:
 		return read_in_full(source->fd, buf, nr);
+	case SOURCE_INCORE:
+		assert(source->read <= source->size);
+		if (nr > source->size - source->read)
+			nr = source->size - source->read;
+		memcpy(buf, (unsigned char *)source->buf + source->read, nr);
+		source->read += nr;
+		return nr;
 	default:
 		BUG("unknown bulk-checkin source: %d", source->type);
 	}