diff mbox series

[v3,22/30] pseudo-merge: implement support for reading pseudo-merge commits

Message ID 8ba0a9c5402fb154bc316768a8fbb016e302a686.1716318089.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series pack-bitmap: pseudo-merge reachability bitmaps | expand

Commit Message

Taylor Blau May 21, 2024, 7:02 p.m. UTC
Implement the basic API for reading pseudo-merge bitmaps, which consists
of four basic functions:

  - pseudo_merge_bitmap()
  - use_pseudo_merge()
  - apply_pseudo_merges_for_commit()
  - cascade_pseudo_merges()

These functions are all documented in pseudo-merge.h, but their rough
descriptions are as follows:

  - pseudo_merge_bitmap() reads and inflates the objects EWAH bitmap for
    a given pseudo-merge

  - use_pseudo_merge() does the same as pseudo_merge_bitmap(), but on
    the commits EWAH bitmap, not the objects bitmap

  - apply_pseudo_merges_for_commit() applies all satisfied pseudo-merge
    commits for a given result set, and cascades any yet-unsatisfied
    pseudo-merges if any were applied in the previous step

  - cascade_pseudo_merges() applies all pseudo-merges which are
    satisfied but have not been previously applied, repeating this
    process until no more pseudo-merges can be applied

The core of the API is the latter two functions, which are responsible
for applying pseudo-merges during the object traversal implemented in
the pack-bitmap machinery.

The other two functions (pseudo_merge_bitmap(), and use_pseudo_merge())
are low-level ways to interact with the pseudo-merge machinery, which
will be useful in future commits.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pseudo-merge.c | 231 +++++++++++++++++++++++++++++++++++++++++++++++++
 pseudo-merge.h |  44 ++++++++++
 2 files changed, 275 insertions(+)

Comments

Jeff King May 23, 2024, 10:40 a.m. UTC | #1
On Tue, May 21, 2024 at 03:02:48PM -0400, Taylor Blau wrote:

> These functions are all documented in pseudo-merge.h, but their rough
> descriptions are as follows:
> 
>   - pseudo_merge_bitmap() reads and inflates the objects EWAH bitmap for
>     a given pseudo-merge
> 
>   - use_pseudo_merge() does the same as pseudo_merge_bitmap(), but on
>     the commits EWAH bitmap, not the objects bitmap
> 
>   - apply_pseudo_merges_for_commit() applies all satisfied pseudo-merge
>     commits for a given result set, and cascades any yet-unsatisfied
>     pseudo-merges if any were applied in the previous step
> 
>   - cascade_pseudo_merges() applies all pseudo-merges which are
>     satisfied but have not been previously applied, repeating this
>     process until no more pseudo-merges can be applied

OK, so I think this commit is getting into the meat of how the new
bitmaps will be used. Just to restate it from a high-level to make sure
I understand, I think it is:

  1. When we are traversing (or even before we traverse and just know
     our tips), we can always say "hey, I have a commit in the bitmap;
     does this satisfy any pseudo-merges?". Where "satisfy" is "all of
     the commits pseudo-merged for that bitmap are already in our
     result". And if so, then we can use the pseudo-merge bitmap by
     OR-ing it in.

     And that's apply_pseudo_merges_for_commit().

  2. That "OR" operation may likewise open up new options, so we
     recurse. And that's the "cascade" function.

This commit is not yet adding the calls into this code for part (1). I
think there's an open question there of overhead; i.e., how expensive it
is to check whether each pseudo-merge is satisfied. And whether it makes
sense to just do it once at the beginning of the traversal (with the
provided tips) or to keep checking as we traverse (more expensive, but
it makes it more likely to use an older pseudo-merge bitmap that's had
new history built on top of some of its commits).

But those calls should come later, so let's read on.

> +static int pseudo_merge_ext_at(const struct pseudo_merge_map *pm,
> +			       struct pseudo_merge_commit_ext *ext, size_t at)
> +{
> +	if (at >= pm->map_size)
> +		return error(_("extended pseudo-merge read out-of-bounds "
> +			       "(%"PRIuMAX" >= %"PRIuMAX")"),
> +			     (uintmax_t)at, (uintmax_t)pm->map_size);
> +
> +	ext->nr = get_be32(pm->map + at);
> +	ext->ptr = pm->map + at + sizeof(uint32_t);
> +
> +	return 0;
> +}

I was happy to see the boundary check here. Do we need a length check,
too? We'd need at least four bytes here for the uint32_t. Does map_size
include the trailing hash? If not, then it might provide a bit of slop
(we'd read garbage, but never go outside the mmap).

I guess the ">=" in the size check implies that we have at least one
byte, but I don't think anything promises that we're correctly 4-byte
aligned.

The rest of the length check is here:

> +struct ewah_bitmap *pseudo_merge_bitmap(const struct pseudo_merge_map *pm,
> +					struct pseudo_merge *merge)
> +{
> +	if (!merge->loaded_commits)
> +		BUG("cannot use unloaded pseudo-merge bitmap");
> +
> +	if (!merge->loaded_bitmap) {
> +		size_t at = merge->bitmap_at;
> +
> +		merge->bitmap = read_bitmap(pm->map, pm->map_size, &at);
> +		merge->loaded_bitmap = 1;
> +	}
> +
> +	return merge->bitmap;
> +}

When we call read_bitmap(), it knows where the end is, and it's
careful to avoid reading past it. Good.

-Peff
Taylor Blau May 23, 2024, 6:09 p.m. UTC | #2
On Thu, May 23, 2024 at 06:40:00AM -0400, Jeff King wrote:
> OK, so I think this commit is getting into the meat of how the new
> bitmaps will be used. Just to restate it from a high-level to make sure
> I understand, I think it is:
>
>   1. When we are traversing (or even before we traverse and just know
>      our tips), we can always say "hey, I have a commit in the bitmap;
>      does this satisfy any pseudo-merges?". Where "satisfy" is "all of
>      the commits pseudo-merged for that bitmap are already in our
>      result". And if so, then we can use the pseudo-merge bitmap by
>      OR-ing it in.
>
>      And that's apply_pseudo_merges_for_commit().
>
>   2. That "OR" operation may likewise open up new options, so we
>      recurse. And that's the "cascade" function.

Exactly. I think implicit in the above is that your (2) is also a
recursive step, since each cascade step may open us up to new
pseudo-merges, which themselves may reach objects which satisfy other
pseudo-merges, and so on.

> > +static int pseudo_merge_ext_at(const struct pseudo_merge_map *pm,
> > +			       struct pseudo_merge_commit_ext *ext, size_t at)
> > +{
> > +	if (at >= pm->map_size)
> > +		return error(_("extended pseudo-merge read out-of-bounds "
> > +			       "(%"PRIuMAX" >= %"PRIuMAX")"),
> > +			     (uintmax_t)at, (uintmax_t)pm->map_size);
> > +
> > +	ext->nr = get_be32(pm->map + at);
> > +	ext->ptr = pm->map + at + sizeof(uint32_t);
> > +
> > +	return 0;
> > +}
>
> I was happy to see the boundary check here. Do we need a length check,
> too? We'd need at least four bytes here for the uint32_t. Does map_size
> include the trailing hash? If not, then it might provide a bit of slop
> (we'd read garbage, but never go outside the mmap).
>
> I guess the ">=" in the size check implies that we have at least one
> byte, but I don't think anything promises that we're correctly 4-byte
> aligned.

Yeah, we could read into the trailing hash area, which would just be
garbage from our perspective. But I think that adding a length check is
easy enough to do, something like:

--- 8< ---
diff --git a/pseudo-merge.c b/pseudo-merge.c
index b539791396..7d13101149 100644
--- a/pseudo-merge.c
+++ b/pseudo-merge.c
@@ -478,6 +478,10 @@ static int pseudo_merge_ext_at(const struct pseudo_merge_map *pm,
 		return error(_("extended pseudo-merge read out-of-bounds "
 			       "(%"PRIuMAX" >= %"PRIuMAX")"),
 			     (uintmax_t)at, (uintmax_t)pm->map_size);
+	if (at + 4 >= pm->map_size)
+		return error(_("extended pseudo-merge entry is too short "
+			       "(%"PRIuMAX" >= %"PRIuMAX")"),
+			     (uintmax_t)(at + 4), (uintmax_t)pm->map_size);

 	ext->nr = get_be32(pm->map + at);
 	ext->ptr = pm->map + at + sizeof(uint32_t);
--- >8 ---

> The rest of the length check is here:
>
> > +struct ewah_bitmap *pseudo_merge_bitmap(const struct pseudo_merge_map *pm,
> > +					struct pseudo_merge *merge)
> > +{
> > +	if (!merge->loaded_commits)
> > +		BUG("cannot use unloaded pseudo-merge bitmap");
> > +
> > +	if (!merge->loaded_bitmap) {
> > +		size_t at = merge->bitmap_at;
> > +
> > +		merge->bitmap = read_bitmap(pm->map, pm->map_size, &at);
> > +		merge->loaded_bitmap = 1;
> > +	}
> > +
> > +	return merge->bitmap;
> > +}
>
> When we call read_bitmap(), it knows where the end is, and it's
> careful to avoid reading past it. Good.

Yep, thanks for double checking.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/pseudo-merge.c b/pseudo-merge.c
index 1aca70ecdfb..0f50ac6183e 100644
--- a/pseudo-merge.c
+++ b/pseudo-merge.c
@@ -10,6 +10,7 @@ 
 #include "commit.h"
 #include "alloc.h"
 #include "progress.h"
+#include "hex.h"
 
 #define DEFAULT_PSEUDO_MERGE_DECAY 1.0f
 #define DEFAULT_PSEUDO_MERGE_MAX_MERGES 64
@@ -464,3 +465,233 @@  void free_pseudo_merge_map(struct pseudo_merge_map *pm)
 	}
 	free(pm->v);
 }
+
+struct pseudo_merge_commit_ext {
+	uint32_t nr;
+	const unsigned char *ptr;
+};
+
+static int pseudo_merge_ext_at(const struct pseudo_merge_map *pm,
+			       struct pseudo_merge_commit_ext *ext, size_t at)
+{
+	if (at >= pm->map_size)
+		return error(_("extended pseudo-merge read out-of-bounds "
+			       "(%"PRIuMAX" >= %"PRIuMAX")"),
+			     (uintmax_t)at, (uintmax_t)pm->map_size);
+
+	ext->nr = get_be32(pm->map + at);
+	ext->ptr = pm->map + at + sizeof(uint32_t);
+
+	return 0;
+}
+
+struct ewah_bitmap *pseudo_merge_bitmap(const struct pseudo_merge_map *pm,
+					struct pseudo_merge *merge)
+{
+	if (!merge->loaded_commits)
+		BUG("cannot use unloaded pseudo-merge bitmap");
+
+	if (!merge->loaded_bitmap) {
+		size_t at = merge->bitmap_at;
+
+		merge->bitmap = read_bitmap(pm->map, pm->map_size, &at);
+		merge->loaded_bitmap = 1;
+	}
+
+	return merge->bitmap;
+}
+
+struct pseudo_merge *use_pseudo_merge(const struct pseudo_merge_map *pm,
+				      struct pseudo_merge *merge)
+{
+	if (!merge->loaded_commits) {
+		size_t pos = merge->at;
+
+		merge->commits = read_bitmap(pm->map, pm->map_size, &pos);
+		merge->bitmap_at = pos;
+		merge->loaded_commits = 1;
+	}
+	return merge;
+}
+
+static struct pseudo_merge *pseudo_merge_at(const struct pseudo_merge_map *pm,
+					    struct object_id *oid,
+					    size_t want)
+{
+	size_t lo = 0;
+	size_t hi = pm->nr;
+
+	while (lo < hi) {
+		size_t mi = lo + (hi - lo) / 2;
+		size_t got = pm->v[mi].at;
+
+		if (got == want)
+			return use_pseudo_merge(pm, &pm->v[mi]);
+		else if (got < want)
+			hi = mi;
+		else
+			lo = mi + 1;
+	}
+
+	warning(_("could not find pseudo-merge for commit %s at offset %"PRIuMAX),
+		oid_to_hex(oid), (uintmax_t)want);
+
+	return NULL;
+}
+
+struct pseudo_merge_commit {
+	uint32_t commit_pos;
+	uint64_t pseudo_merge_ofs;
+};
+
+#define PSEUDO_MERGE_COMMIT_RAWSZ (sizeof(uint32_t)+sizeof(uint64_t))
+
+static void read_pseudo_merge_commit_at(struct pseudo_merge_commit *merge,
+					const unsigned char *at)
+{
+	merge->commit_pos = get_be32(at);
+	merge->pseudo_merge_ofs = get_be64(at + sizeof(uint32_t));
+}
+
+static int nth_pseudo_merge_ext(const struct pseudo_merge_map *pm,
+				struct pseudo_merge_commit_ext *ext,
+				struct pseudo_merge_commit *merge,
+				uint32_t n)
+{
+	size_t ofs;
+
+	if (n >= ext->nr)
+		return error(_("extended pseudo-merge lookup out-of-bounds "
+			       "(%"PRIu32" >= %"PRIu32")"), n, ext->nr);
+
+	ofs = get_be64(ext->ptr + st_mult(n, sizeof(uint64_t)));
+	if (ofs >= pm->map_size)
+		return error(_("out-of-bounds read: (%"PRIuMAX" >= %"PRIuMAX")"),
+			     (uintmax_t)ofs, (uintmax_t)pm->map_size);
+
+	read_pseudo_merge_commit_at(merge, pm->map + ofs);
+
+	return 0;
+}
+
+static unsigned apply_pseudo_merge(const struct pseudo_merge_map *pm,
+				   struct pseudo_merge *merge,
+				   struct bitmap *result,
+				   struct bitmap *roots)
+{
+	if (merge->satisfied)
+		return 0;
+
+	if (!ewah_bitmap_is_subset(merge->commits, roots ? roots : result))
+		return 0;
+
+	bitmap_or_ewah(result, pseudo_merge_bitmap(pm, merge));
+	if (roots)
+		bitmap_or_ewah(roots, pseudo_merge_bitmap(pm, merge));
+	merge->satisfied = 1;
+
+	return 1;
+}
+
+static int pseudo_merge_commit_cmp(const void *va, const void *vb)
+{
+	struct pseudo_merge_commit merge;
+	uint32_t key = *(uint32_t*)va;
+
+	read_pseudo_merge_commit_at(&merge, vb);
+
+	if (key < merge.commit_pos)
+		return -1;
+	if (key > merge.commit_pos)
+		return 1;
+	return 0;
+}
+
+static struct pseudo_merge_commit *find_pseudo_merge(const struct pseudo_merge_map *pm,
+						     uint32_t pos)
+{
+	if (!pm->commits_nr)
+		return NULL;
+
+	return bsearch(&pos, pm->commits, pm->commits_nr,
+		       PSEUDO_MERGE_COMMIT_RAWSZ, pseudo_merge_commit_cmp);
+}
+
+int apply_pseudo_merges_for_commit(const struct pseudo_merge_map *pm,
+				   struct bitmap *result,
+				   struct commit *commit, uint32_t commit_pos)
+{
+	struct pseudo_merge *merge;
+	struct pseudo_merge_commit *merge_commit;
+	int ret = 0;
+
+	merge_commit = find_pseudo_merge(pm, commit_pos);
+	if (!merge_commit)
+		return 0;
+
+	if (merge_commit->pseudo_merge_ofs & ((uint64_t)1<<63)) {
+		struct pseudo_merge_commit_ext ext = { 0 };
+		off_t ofs = merge_commit->pseudo_merge_ofs & ~((uint64_t)1<<63);
+		uint32_t i;
+
+		if (pseudo_merge_ext_at(pm, &ext, ofs) < -1) {
+			warning(_("could not read extended pseudo-merge table "
+				  "for commit %s"),
+				oid_to_hex(&commit->object.oid));
+			return ret;
+		}
+
+		for (i = 0; i < ext.nr; i++) {
+			if (nth_pseudo_merge_ext(pm, &ext, merge_commit, i) < 0)
+				return ret;
+
+			merge = pseudo_merge_at(pm, &commit->object.oid,
+						merge_commit->pseudo_merge_ofs);
+
+			if (!merge)
+				return ret;
+
+			if (apply_pseudo_merge(pm, merge, result, NULL))
+				ret++;
+		}
+	} else {
+		merge = pseudo_merge_at(pm, &commit->object.oid,
+					merge_commit->pseudo_merge_ofs);
+
+		if (!merge)
+			return ret;
+
+		if (apply_pseudo_merge(pm, merge, result, NULL))
+			ret++;
+	}
+
+	if (ret)
+		cascade_pseudo_merges(pm, result, NULL);
+
+	return ret;
+}
+
+int cascade_pseudo_merges(const struct pseudo_merge_map *pm,
+			  struct bitmap *result,
+			  struct bitmap *roots)
+{
+	unsigned any_satisfied;
+	int ret = 0;
+
+	do {
+		struct pseudo_merge *merge;
+		uint32_t i;
+
+		any_satisfied = 0;
+
+		for (i = 0; i < pm->nr; i++) {
+			merge = use_pseudo_merge(pm, &pm->v[i]);
+			if (apply_pseudo_merge(pm, merge, result, roots)) {
+				any_satisfied |= 1;
+				ret++;
+			}
+		}
+	} while (any_satisfied);
+
+	return ret;
+}
diff --git a/pseudo-merge.h b/pseudo-merge.h
index a3f0243062c..c00b622be4b 100644
--- a/pseudo-merge.h
+++ b/pseudo-merge.h
@@ -162,4 +162,48 @@  struct pseudo_merge {
  */
 void free_pseudo_merge_map(struct pseudo_merge_map *pm);
 
+/*
+ * Loads the bitmap corresponding to the given pseudo-merge from the
+ * map, if it has not already been loaded.
+ */
+struct ewah_bitmap *pseudo_merge_bitmap(const struct pseudo_merge_map *pm,
+					struct pseudo_merge *merge);
+
+/*
+ * Loads the pseudo-merge and its commits bitmap from the given
+ * pseudo-merge map, if it has not already been loaded.
+ */
+struct pseudo_merge *use_pseudo_merge(const struct pseudo_merge_map *pm,
+				      struct pseudo_merge *merge);
+
+/*
+ * Applies pseudo-merge(s) containing the given commit to the bitmap
+ * "result".
+ *
+ * If any pseudo-merge(s) were satisfied, returns the number
+ * satisfied, otherwise returns 0. If any were satisfied, the
+ * remaining unsatisfied pseudo-merges are cascaded (see below).
+ */
+int apply_pseudo_merges_for_commit(const struct pseudo_merge_map *pm,
+				   struct bitmap *result,
+				   struct commit *commit, uint32_t commit_pos);
+
+/*
+ * Applies pseudo-merge(s) which are satisfied according to the
+ * current bitmap in result (or roots, see below). If any
+ * pseudo-merges were satisfied, repeat the process over unsatisfied
+ * pseudo-merge commits until no more pseudo-merges are satisfied.
+ *
+ * Result is the bitmap to which the pseudo-merge(s) are applied.
+ * Roots (if given) is a bitmap of the traversal tip(s) for either
+ * side of a reachability traversal.
+ *
+ * Roots may given instead of a populated results bitmap at the
+ * beginning of a traversal on either side where the reachability
+ * closure over tips is not yet known.
+ */
+int cascade_pseudo_merges(const struct pseudo_merge_map *pm,
+			  struct bitmap *result,
+			  struct bitmap *roots);
+
 #endif