[v3,07/11] Add a function to determine unique prefixes for a list of strings
diff mbox series

Message ID 3000d7d08dfb64511b4ebf9d05617897dd7252f7.1563289115.git.gitgitgadget@gmail.com
State New
Headers show
Series
  • git add -i: add a rudimentary version in C (supporting only status and help so far)
Related show

Commit Message

Johannes Schindelin via GitGitGadget July 16, 2019, 2:58 p.m. UTC
From: Slavica Djukic <slawica92@hotmail.com>

In the `git add -i` command, we show unique prefixes of the commands and
files, to give an indication what prefix would select them.

Naturally, the C implementation looks a lot different than the Perl
implementation: in Perl, a trie is much easier implemented, while we
already have a pretty neat hashmap implementation in C that we use for
the purpose of storing (not necessarily unique) prefixes.

The idea: for each item that we add, we generate prefixes starting with
the first letter, then the first two letters, then three, etc, until we
find a prefix that is unique (or until the prefix length would be
longer than we want). If we encounter a previously-unique prefix on the
way, we adjust that item's prefix to make it unique again (or we mark it
as having no unique prefix if we failed to find one). These partial
prefixes are stored in a hash map (for quick lookup times).

To make sure that this function works as expected, we add a test using a
special-purpose test helper that was added for that purpose.

Note: We expect the list of prefix items to be passed in as a list of
pointers rather than as regular list to avoid having to copy information
(the actual items will most likely contain more information than just
the name and the length of the unique prefix, but passing in `struct
prefix_item *` would not allow for that).

Signed-off-by: Slavica Djukic <slawica92@hotmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 Makefile                   |   2 +
 prefix-map.c               | 109 +++++++++++++++++++++++++++++++++++++
 prefix-map.h               |  40 ++++++++++++++
 t/helper/test-prefix-map.c |  58 ++++++++++++++++++++
 t/helper/test-tool.c       |   1 +
 t/helper/test-tool.h       |   1 +
 t/t0018-prefix-map.sh      |  10 ++++
 7 files changed, 221 insertions(+)
 create mode 100644 prefix-map.c
 create mode 100644 prefix-map.h
 create mode 100644 t/helper/test-prefix-map.c
 create mode 100755 t/t0018-prefix-map.sh

Comments

Junio C Hamano July 31, 2019, 6:39 p.m. UTC | #1
"Slavica Djukic via GitGitGadget" <gitgitgadget@gmail.com> writes:

> The idea: for each item that we add, we generate prefixes starting with
> the first letter, then the first two letters, then three, etc, until we
> find a prefix that is unique (or until the prefix length would be
> longer than we want). If we encounter a previously-unique prefix on the
> way, we adjust that item's prefix to make it unique again (or we mark it
> as having no unique prefix if we failed to find one). These partial
> prefixes are stored in a hash map (for quick lookup times).

OK.  I suppose such a machinery that accepts a set of strings and
then give us back a set of unique prefix length for each element of
the set would be useful in general, even outside the "and choose"
part of "list and choose".  Nice design.

> To make sure that this function works as expected, we add a test using a
> special-purpose test helper that was added for that purpose.

Somehow the above repeatedly says how special purpose it is
redundantly ;-)

> diff --git a/prefix-map.h b/prefix-map.h
> new file mode 100644
> index 0000000000..ce3b8a4a32
> --- /dev/null
> +++ b/prefix-map.h
> @@ -0,0 +1,40 @@
> +#ifndef PREFIX_MAP_H
> +#define PREFIX_MAP_H
> +
> +#include "hashmap.h"
> +
> +struct prefix_item {
> +	const char *name;
> +	size_t prefix_length;
> +};
> +
> +struct prefix_map_entry {
> +	struct hashmap_entry e;
> +	const char *name;
> +	size_t prefix_length;
> +	/* if item is NULL, the prefix is not unique */
> +	struct prefix_item *item;
> +};
> +
> +struct prefix_map {
> +	struct hashmap map;
> +	int min_length, max_length;
> +};
> +
> +/*
> + * Find unique prefixes in a given list of strings.
> + *
> + * Typically, the `struct prefix_item` information will be but a field in the
> + * actual item struct; For this reason, the `list` parameter is specified as a
> + * list of pointers to the items.
> + *
> + * The `min_length`/`max_length` parameters define what length the unique
> + * prefixes should have.
> + *
> + * If no unique prefix could be found for a given item, its `prefix_length`
> + * will be set to 0.
> + */
> +void find_unique_prefixes(struct prefix_item **list, size_t nr,
> +			  int min_length, int max_length);
> +
> +#endif

Looks like a quite sane interface to me.

Thanks.
SZEDER Gábor Aug. 24, 2019, 12:38 p.m. UTC | #2
On Tue, Jul 16, 2019 at 07:58:42AM -0700, Slavica Djukic via GitGitGadget wrote:
> In the `git add -i` command, we show unique prefixes of the commands and
> files, to give an indication what prefix would select them.
> 
> Naturally, the C implementation looks a lot different than the Perl
> implementation: in Perl, a trie is much easier implemented, while we
> already have a pretty neat hashmap implementation in C that we use for
> the purpose of storing (not necessarily unique) prefixes.
> 
> The idea: for each item that we add, we generate prefixes starting with
> the first letter, then the first two letters, then three, etc, until we
> find a prefix that is unique (or until the prefix length would be
> longer than we want). If we encounter a previously-unique prefix on the
> way, we adjust that item's prefix to make it unique again (or we mark it
> as having no unique prefix if we failed to find one). These partial
> prefixes are stored in a hash map (for quick lookup times).
> 
> To make sure that this function works as expected, we add a test using a
> special-purpose test helper that was added for that purpose.
> 
> Note: We expect the list of prefix items to be passed in as a list of
> pointers rather than as regular list to avoid having to copy information
> (the actual items will most likely contain more information than just
> the name and the length of the unique prefix, but passing in `struct
> prefix_item *` would not allow for that).

> diff --git a/prefix-map.c b/prefix-map.c
> new file mode 100644
> index 0000000000..747ddb4ebc
> --- /dev/null
> +++ b/prefix-map.c
> @@ -0,0 +1,109 @@
> +#include "cache.h"
> +#include "prefix-map.h"
> +
> +static int map_cmp(const void *unused_cmp_data,
> +		   const void *entry,
> +		   const void *entry_or_key,
> +		   const void *unused_keydata)
> +{
> +	const struct prefix_map_entry *a = entry;
> +	const struct prefix_map_entry *b = entry_or_key;
> +
> +	return a->prefix_length != b->prefix_length ||
> +		strncmp(a->name, b->name, a->prefix_length);
> +}
> +
> +static void add_prefix_entry(struct hashmap *map, const char *name,
> +			     size_t prefix_length, struct prefix_item *item)
> +{
> +	struct prefix_map_entry *result = xmalloc(sizeof(*result));
> +	result->name = name;
> +	result->prefix_length = prefix_length;
> +	result->item = item;
> +	hashmap_entry_init(result, memhash(name, prefix_length));
> +	hashmap_add(map, result);
> +}
> +
> +static void init_prefix_map(struct prefix_map *prefix_map,
> +			    int min_prefix_length, int max_prefix_length)
> +{
> +	hashmap_init(&prefix_map->map, map_cmp, NULL, 0);
> +	prefix_map->min_length = min_prefix_length;
> +	prefix_map->max_length = max_prefix_length;
> +}
> +
> +static void add_prefix_item(struct prefix_map *prefix_map,
> +			    struct prefix_item *item)
> +{
> +	struct prefix_map_entry e = { { NULL } }, *e2;
> +	int j;
> +
> +	e.item = item;
> +	e.name = item->name;
> +
> +	for (j = prefix_map->min_length;
> +	     j <= prefix_map->max_length && e.name[j]; j++) {
> +		/* Avoid breaking UTF-8 multi-byte sequences */
> +		if (!isascii(e.name[j]))
> +			break;
> +
> +		e.prefix_length = j;
> +		hashmap_entry_init(&e, memhash(e.name, j));
> +		e2 = hashmap_get(&prefix_map->map, &e, NULL);
> +		if (!e2) {
> +			/* prefix is unique at this stage */
> +			item->prefix_length = j;
> +			add_prefix_entry(&prefix_map->map, e.name, j, item);
> +			break;
> +		}
> +
> +		if (!e2->item)
> +			continue; /* non-unique prefix */
> +
> +		if (j != e2->item->prefix_length || memcmp(e.name, e2->name, j))
> +			BUG("unexpected prefix length: %d != %d (%s != %s)",
> +			    j, (int)e2->item->prefix_length, e.name, e2->name);
> +
> +		/* skip common prefix */
> +		for (; j < prefix_map->max_length && e.name[j]; j++) {
> +			if (e.item->name[j] != e2->item->name[j])
> +				break;
> +			add_prefix_entry(&prefix_map->map, e.name, j + 1,
> +					 NULL);
> +		}
> +
> +		/* e2 no longer refers to a unique prefix */
> +		if (j < prefix_map->max_length && e2->name[j]) {
> +			/* found a new unique prefix for e2's item */
> +			e2->item->prefix_length = j + 1;
> +			add_prefix_entry(&prefix_map->map, e2->name, j + 1,
> +					 e2->item);
> +		}
> +		else
> +			e2->item->prefix_length = 0;
> +		e2->item = NULL;
> +
> +		if (j < prefix_map->max_length && e.name[j]) {
> +			/* found a unique prefix for the item */
> +			e.item->prefix_length = j + 1;
> +			add_prefix_entry(&prefix_map->map, e.name, j + 1,
> +					 e.item);
> +		} else
> +			/* item has no (short enough) unique prefix */
> +			e.item->prefix_length = 0;
> +
> +		break;
> +	}
> +}
> +
> +void find_unique_prefixes(struct prefix_item **list, size_t nr,
> +			  int min_length, int max_length)
> +{
> +	int i;
> +	struct prefix_map prefix_map;
> +
> +	init_prefix_map(&prefix_map, min_length, max_length);
> +	for (i = 0; i < nr; i++)
> +		add_prefix_item(&prefix_map, list[i]);
> +	hashmap_free(&prefix_map.map, 1);
> +}

Between the commit message, the in-code comment, the names of the new
files, and implementation I was left somewhat confused about what this
is about and how it works.  TBH, I didn't even try to understand how
all the above works, in particular the add_prefix_item() function.

However, I think it would be much-much simpler to first sort (a copy
of?) the array of prefix item pointers based on their 'name' field,
and then look for a unique prefix in each neighboring pair.  Perhaps
it would even be faster, because it doesn't have to allocate a bunch
of hashmap items, though I don't think that it matters much in
practice (i.e. I expect the number of items to be fairly small;
presumably nobody will run interactive add after a mass refactoring
modifying thousands of files).

> diff --git a/prefix-map.h b/prefix-map.h
> new file mode 100644
> index 0000000000..ce3b8a4a32
> --- /dev/null
> +++ b/prefix-map.h
> @@ -0,0 +1,40 @@
> +#ifndef PREFIX_MAP_H
> +#define PREFIX_MAP_H
>
> +#include "hashmap.h"
> +
> +struct prefix_item {
> +	const char *name;
> +	size_t prefix_length;
> +};

This struct is part of find_unique_prefixes()'s signature, good.

> +struct prefix_map_entry {
> +	struct hashmap_entry e;
> +	const char *name;
> +	size_t prefix_length;
> +	/* if item is NULL, the prefix is not unique */
> +	struct prefix_item *item;
> +};
> +
> +struct prefix_map {
> +	struct hashmap map;
> +	int min_length, max_length;
> +};

However, neither of these two structs nor the hashmap appear in the
function's signature, but are all implementation details.  Therefore,
they should not be defined and included here in the header but in the
.c source file.  (But as mentioned above, I think this could be
implemented much simpler without these data structures.)

Furthermore, this is not a map.
A map, in general, is a container of key-value pairs that allows
efficient insertion, removal and lookup.  This so-called prefix_map
does none of that, so it should not be called a map.

> +/*
> + * Find unique prefixes in a given list of strings.

... and stores the length of the unique prefixes in the
'prefix_length' field of the elements of the given array.

> + *
> + * Typically, the `struct prefix_item` information will be but a field in the

s/but //, perhaps?

> + * actual item struct; For this reason, the `list` parameter is specified as a
> + * list of pointers to the items.
> + *
> + * The `min_length`/`max_length` parameters define what length the unique
> + * prefixes should have.
> + *
> + * If no unique prefix could be found for a given item, its `prefix_length`
> + * will be set to 0.
> + */
> +void find_unique_prefixes(struct prefix_item **list, size_t nr,

The first argument is not a list but an array.

> +			  int min_length, int max_length);

size_t, perhaps?  These are closely related to
'prefix_item.prefix_length', which is already (rightfully) size_t.

> +
> +#endif
> diff --git a/t/helper/test-prefix-map.c b/t/helper/test-prefix-map.c
> new file mode 100644
> index 0000000000..3f1c90eaf0
> --- /dev/null
> +++ b/t/helper/test-prefix-map.c
> @@ -0,0 +1,58 @@
> +#include "test-tool.h"
> +#include "cache.h"
> +#include "prefix-map.h"
> +
> +static size_t test_count, failed_count;
> +
> +static void check(int succeeded, const char *file, size_t line_no,
> +		  const char *fmt, ...)
> +{
> +	va_list ap;
> +
> +	test_count++;
> +	if (succeeded)
> +		return;
> +
> +	va_start(ap, fmt);
> +	fprintf(stderr, "%s:%d: ", file, (int)line_no);
> +	vfprintf(stderr, fmt, ap);
> +	fputc('\n', stderr);
> +	va_end(ap);
> +
> +	failed_count++;
> +}
> +
> +#define EXPECT_SIZE_T_EQUALS(expect, actual, hint) \
> +	check(expect == actual, __FILE__, __LINE__, \
> +	      "size_t's do not match: %" \
> +	      PRIdMAX " != %" PRIdMAX " (%s) (%s)", \
> +	      (intmax_t)expect, (intmax_t)actual, #actual, hint)
> +
> +int cmd__prefix_map(int argc, const char **argv)
> +{
> +#define NR 5
> +	struct prefix_item items[NR] = {

You don't have to tell the compiler how many elements this array will
contain, it will figure that out on its own.

> +		{ "unique" },
> +		{ "hell" },
> +		{ "hello" },
> +		{ "wok" },
> +		{ "world" },
> +	};
> +	struct prefix_item *list[NR] = {

Likewise.

> +		items, items + 1, items + 2, items + 3, items + 4
> +	};
> +
> +	find_unique_prefixes(list, NR, 1, 3);

This could be find_unique_prefixes(list, ARRAY_SIZE(list), 1, 3), and
then there is no need for that NR macro anymore.

> +
> +#define EXPECT_PREFIX_LENGTH_EQUALS(expect, index) \
> +	EXPECT_SIZE_T_EQUALS(expect, list[index]->prefix_length, \
> +			     list[index]->name)
> +
> +	EXPECT_PREFIX_LENGTH_EQUALS(1, 0);
> +	EXPECT_PREFIX_LENGTH_EQUALS(0, 1);
> +	EXPECT_PREFIX_LENGTH_EQUALS(0, 2);
> +	EXPECT_PREFIX_LENGTH_EQUALS(3, 3);
> +	EXPECT_PREFIX_LENGTH_EQUALS(3, 4);
> +
> +	return !!failed_count;
> +}
Johannes Schindelin Aug. 27, 2019, 12:14 p.m. UTC | #3
Hi Gábor,

On Sat, 24 Aug 2019, SZEDER Gábor wrote:

> On Tue, Jul 16, 2019 at 07:58:42AM -0700, Slavica Djukic via GitGitGadget wrote:
> > In the `git add -i` command, we show unique prefixes of the commands and
> > files, to give an indication what prefix would select them.
> >
> > Naturally, the C implementation looks a lot different than the Perl
> > implementation: in Perl, a trie is much easier implemented, while we
> > already have a pretty neat hashmap implementation in C that we use for
> > the purpose of storing (not necessarily unique) prefixes.
> >
> > The idea: for each item that we add, we generate prefixes starting with
> > the first letter, then the first two letters, then three, etc, until we
> > find a prefix that is unique (or until the prefix length would be
> > longer than we want). If we encounter a previously-unique prefix on the
> > way, we adjust that item's prefix to make it unique again (or we mark it
> > as having no unique prefix if we failed to find one). These partial
> > prefixes are stored in a hash map (for quick lookup times).
> >
> > To make sure that this function works as expected, we add a test using a
> > special-purpose test helper that was added for that purpose.
> >
> > Note: We expect the list of prefix items to be passed in as a list of
> > pointers rather than as regular list to avoid having to copy information
> > (the actual items will most likely contain more information than just
> > the name and the length of the unique prefix, but passing in `struct
> > prefix_item *` would not allow for that).
>
> > diff --git a/prefix-map.c b/prefix-map.c
> > new file mode 100644
> > index 0000000000..747ddb4ebc
> > --- /dev/null
> > +++ b/prefix-map.c
> > @@ -0,0 +1,109 @@
> > +#include "cache.h"
> > +#include "prefix-map.h"
> > +
> > +static int map_cmp(const void *unused_cmp_data,
> > +		   const void *entry,
> > +		   const void *entry_or_key,
> > +		   const void *unused_keydata)
> > +{
> > +	const struct prefix_map_entry *a = entry;
> > +	const struct prefix_map_entry *b = entry_or_key;
> > +
> > +	return a->prefix_length != b->prefix_length ||
> > +		strncmp(a->name, b->name, a->prefix_length);
> > +}
> > +
> > +static void add_prefix_entry(struct hashmap *map, const char *name,
> > +			     size_t prefix_length, struct prefix_item *item)
> > +{
> > +	struct prefix_map_entry *result = xmalloc(sizeof(*result));
> > +	result->name = name;
> > +	result->prefix_length = prefix_length;
> > +	result->item = item;
> > +	hashmap_entry_init(result, memhash(name, prefix_length));
> > +	hashmap_add(map, result);
> > +}
> > +
> > +static void init_prefix_map(struct prefix_map *prefix_map,
> > +			    int min_prefix_length, int max_prefix_length)
> > +{
> > +	hashmap_init(&prefix_map->map, map_cmp, NULL, 0);
> > +	prefix_map->min_length = min_prefix_length;
> > +	prefix_map->max_length = max_prefix_length;
> > +}
> > +
> > +static void add_prefix_item(struct prefix_map *prefix_map,
> > +			    struct prefix_item *item)
> > +{
> > +	struct prefix_map_entry e = { { NULL } }, *e2;
> > +	int j;
> > +
> > +	e.item = item;
> > +	e.name = item->name;
> > +
> > +	for (j = prefix_map->min_length;
> > +	     j <= prefix_map->max_length && e.name[j]; j++) {
> > +		/* Avoid breaking UTF-8 multi-byte sequences */
> > +		if (!isascii(e.name[j]))
> > +			break;
> > +
> > +		e.prefix_length = j;
> > +		hashmap_entry_init(&e, memhash(e.name, j));
> > +		e2 = hashmap_get(&prefix_map->map, &e, NULL);
> > +		if (!e2) {
> > +			/* prefix is unique at this stage */
> > +			item->prefix_length = j;
> > +			add_prefix_entry(&prefix_map->map, e.name, j, item);
> > +			break;
> > +		}
> > +
> > +		if (!e2->item)
> > +			continue; /* non-unique prefix */
> > +
> > +		if (j != e2->item->prefix_length || memcmp(e.name, e2->name, j))
> > +			BUG("unexpected prefix length: %d != %d (%s != %s)",
> > +			    j, (int)e2->item->prefix_length, e.name, e2->name);
> > +
> > +		/* skip common prefix */
> > +		for (; j < prefix_map->max_length && e.name[j]; j++) {
> > +			if (e.item->name[j] != e2->item->name[j])
> > +				break;
> > +			add_prefix_entry(&prefix_map->map, e.name, j + 1,
> > +					 NULL);
> > +		}
> > +
> > +		/* e2 no longer refers to a unique prefix */
> > +		if (j < prefix_map->max_length && e2->name[j]) {
> > +			/* found a new unique prefix for e2's item */
> > +			e2->item->prefix_length = j + 1;
> > +			add_prefix_entry(&prefix_map->map, e2->name, j + 1,
> > +					 e2->item);
> > +		}
> > +		else
> > +			e2->item->prefix_length = 0;
> > +		e2->item = NULL;
> > +
> > +		if (j < prefix_map->max_length && e.name[j]) {
> > +			/* found a unique prefix for the item */
> > +			e.item->prefix_length = j + 1;
> > +			add_prefix_entry(&prefix_map->map, e.name, j + 1,
> > +					 e.item);
> > +		} else
> > +			/* item has no (short enough) unique prefix */
> > +			e.item->prefix_length = 0;
> > +
> > +		break;
> > +	}
> > +}
> > +
> > +void find_unique_prefixes(struct prefix_item **list, size_t nr,
> > +			  int min_length, int max_length)
> > +{
> > +	int i;
> > +	struct prefix_map prefix_map;
> > +
> > +	init_prefix_map(&prefix_map, min_length, max_length);
> > +	for (i = 0; i < nr; i++)
> > +		add_prefix_item(&prefix_map, list[i]);
> > +	hashmap_free(&prefix_map.map, 1);
> > +}
>
> Between the commit message, the in-code comment, the names of the new
> files, and implementation I was left somewhat confused about what this
> is about and how it works.  TBH, I didn't even try to understand how
> all the above works, in particular the add_prefix_item() function.

Let me try to explain it here, and maybe you can help me by suggesting
an improved commit message and/or code comments?

The problem is this: given a set of items with labels (e.g. file names),
find, for each item, the unique prefix that identifies it. Example:
given the files `hello.txt`, `heaven.txt` and `hell.txt`, the items'
unique prefixes would be `hello`, `hea` and `hell.`, respectively.

In `git add -i`, we actually only want to allow alphanumerical prefixes,
and we also want at least one, and at most three characters, so only the
second item would have an admissible unique prefix: `hea`.

> However, I think it would be much-much simpler to first sort (a copy
> of?) the array of prefix item pointers based on their 'name' field,
> and then look for a unique prefix in each neighboring pair.  Perhaps
> it would even be faster, because it doesn't have to allocate a bunch
> of hashmap items, though I don't think that it matters much in
> practice (i.e. I expect the number of items to be fairly small;
> presumably nobody will run interactive add after a mass refactoring
> modifying thousands of files).

The time complexity of the sorted list would be O(n*log(n)), while the
hashmap-based complexity would be an amortized O(n).

And yes, you would not _want_ to run interactive add after a mass
refactoring. But it happens. It happens to me more times than I care to
admit. And you know what? I really appreciate that even the Perl version
is relatively snappy in those circumstances.

> > diff --git a/prefix-map.h b/prefix-map.h
> > new file mode 100644
> > index 0000000000..ce3b8a4a32
> > --- /dev/null
> > +++ b/prefix-map.h
> > @@ -0,0 +1,40 @@
> > +#ifndef PREFIX_MAP_H
> > +#define PREFIX_MAP_H
> >
> > +#include "hashmap.h"
> > +
> > +struct prefix_item {
> > +	const char *name;
> > +	size_t prefix_length;
> > +};
>
> This struct is part of find_unique_prefixes()'s signature, good.
>
> > +struct prefix_map_entry {
> > +	struct hashmap_entry e;
> > +	const char *name;
> > +	size_t prefix_length;
> > +	/* if item is NULL, the prefix is not unique */
> > +	struct prefix_item *item;
> > +};
> > +
> > +struct prefix_map {
> > +	struct hashmap map;
> > +	int min_length, max_length;
> > +};
>
> However, neither of these two structs nor the hashmap appear in the
> function's signature, but are all implementation details.  Therefore,
> they should not be defined and included here in the header but in the
> .c source file.  (But as mentioned above, I think this could be
> implemented much simpler without these data structures.)

Right you are!

> Furthermore, this is not a map.
> A map, in general, is a container of key-value pairs that allows
> efficient insertion, removal and lookup.  This so-called prefix_map
> does none of that, so it should not be called a map.

What would you call it instead?

(I went with "map" because the underlying data structure is a "hash
map", I know, not the best argument, but I failed to find a better
name...)

I also have to admit that I thought that I could fix the design where
`git-add--interactive.perl` creates this trie, but then still performs a
linear search when searching by prefix. That seems not to be possible,
though, as the unique prefixes are limited to certain character ranges,
while the lookup-by-prefix is not limited in that way.

> > +/*
> > + * Find unique prefixes in a given list of strings.
>
> ... and stores the length of the unique prefixes in the
> 'prefix_length' field of the elements of the given array.

Good idea. I changed it to also explain what is meant by "unique
prefix":

 * Given a list of names, find unique prefixes (i.e. the first <n> characters
 * that uniquely identify the names) and store the lengths of the unique
 * prefixes in the 'prefix_length' field of the elements of the given array..

> > + *
> > + * Typically, the `struct prefix_item` information will be but a field in the
>
> s/but //, perhaps?

Sure. I am relatively certain that it is correct grammar, but it is
probably a good idea to remove it.

> > + * actual item struct; For this reason, the `list` parameter is specified as a
> > + * list of pointers to the items.
> > + *
> > + * The `min_length`/`max_length` parameters define what length the unique
> > + * prefixes should have.
> > + *
> > + * If no unique prefix could be found for a given item, its `prefix_length`
> > + * will be set to 0.
> > + */
> > +void find_unique_prefixes(struct prefix_item **list, size_t nr,
>
> The first argument is not a list but an array.

Indeed.

> > +			  int min_length, int max_length);
>
> size_t, perhaps?  These are closely related to
> 'prefix_item.prefix_length', which is already (rightfully) size_t.

True.

> > +#endif
> > diff --git a/t/helper/test-prefix-map.c b/t/helper/test-prefix-map.c
> > new file mode 100644
> > index 0000000000..3f1c90eaf0
> > --- /dev/null
> > +++ b/t/helper/test-prefix-map.c
> > @@ -0,0 +1,58 @@
> > +#include "test-tool.h"
> > +#include "cache.h"
> > +#include "prefix-map.h"
> > +
> > +static size_t test_count, failed_count;
> > +
> > +static void check(int succeeded, const char *file, size_t line_no,
> > +		  const char *fmt, ...)
> > +{
> > +	va_list ap;
> > +
> > +	test_count++;
> > +	if (succeeded)
> > +		return;
> > +
> > +	va_start(ap, fmt);
> > +	fprintf(stderr, "%s:%d: ", file, (int)line_no);
> > +	vfprintf(stderr, fmt, ap);
> > +	fputc('\n', stderr);
> > +	va_end(ap);
> > +
> > +	failed_count++;
> > +}
> > +
> > +#define EXPECT_SIZE_T_EQUALS(expect, actual, hint) \
> > +	check(expect == actual, __FILE__, __LINE__, \
> > +	      "size_t's do not match: %" \
> > +	      PRIdMAX " != %" PRIdMAX " (%s) (%s)", \
> > +	      (intmax_t)expect, (intmax_t)actual, #actual, hint)
> > +
> > +int cmd__prefix_map(int argc, const char **argv)
> > +{
> > +#define NR 5
> > +	struct prefix_item items[NR] = {
>
> You don't have to tell the compiler how many elements this array will
> contain, it will figure that out on its own.
>
> > +		{ "unique" },
> > +		{ "hell" },
> > +		{ "hello" },
> > +		{ "wok" },
> > +		{ "world" },
> > +	};
> > +	struct prefix_item *list[NR] = {
>
> Likewise.

That is correct.

What the compiler _cannot_ figure out, on its own, however, is that
`items` and `list` _need_ to contain the same number of items.

Hence the need for `NR`.

> > +		items, items + 1, items + 2, items + 3, items + 4
> > +	};
> > +
> > +	find_unique_prefixes(list, NR, 1, 3);
>
> This could be find_unique_prefixes(list, ARRAY_SIZE(list), 1, 3), and
> then there is no need for that NR macro anymore.
>
> > +
> > +#define EXPECT_PREFIX_LENGTH_EQUALS(expect, index) \
> > +	EXPECT_SIZE_T_EQUALS(expect, list[index]->prefix_length, \
> > +			     list[index]->name)
> > +
> > +	EXPECT_PREFIX_LENGTH_EQUALS(1, 0);
> > +	EXPECT_PREFIX_LENGTH_EQUALS(0, 1);
> > +	EXPECT_PREFIX_LENGTH_EQUALS(0, 2);
> > +	EXPECT_PREFIX_LENGTH_EQUALS(3, 3);
> > +	EXPECT_PREFIX_LENGTH_EQUALS(3, 4);
> > +
> > +	return !!failed_count;
> > +}

Thank you for your review!
Dscho
SZEDER Gábor Aug. 28, 2019, 4:30 p.m. UTC | #4
On Tue, Aug 27, 2019 at 02:14:06PM +0200, Johannes Schindelin wrote:
> Hi Gábor,
> 
> On Sat, 24 Aug 2019, SZEDER Gábor wrote:
> 
> > On Tue, Jul 16, 2019 at 07:58:42AM -0700, Slavica Djukic via GitGitGadget wrote:
> > > In the `git add -i` command, we show unique prefixes of the commands and
> > > files, to give an indication what prefix would select them.
> > >
> > > Naturally, the C implementation looks a lot different than the Perl
> > > implementation: in Perl, a trie is much easier implemented, while we
> > > already have a pretty neat hashmap implementation in C that we use for
> > > the purpose of storing (not necessarily unique) prefixes.
> > >
> > > The idea: for each item that we add, we generate prefixes starting with
> > > the first letter, then the first two letters, then three, etc, until we
> > > find a prefix that is unique (or until the prefix length would be
> > > longer than we want). If we encounter a previously-unique prefix on the
> > > way, we adjust that item's prefix to make it unique again (or we mark it
> > > as having no unique prefix if we failed to find one). These partial
> > > prefixes are stored in a hash map (for quick lookup times).
> > >
> > > To make sure that this function works as expected, we add a test using a
> > > special-purpose test helper that was added for that purpose.
> > >
> > > Note: We expect the list of prefix items to be passed in as a list of
> > > pointers rather than as regular list to avoid having to copy information
> > > (the actual items will most likely contain more information than just
> > > the name and the length of the unique prefix, but passing in `struct
> > > prefix_item *` would not allow for that).
> >
> > > diff --git a/prefix-map.c b/prefix-map.c
> > > new file mode 100644
> > > index 0000000000..747ddb4ebc
> > > --- /dev/null
> > > +++ b/prefix-map.c
> > > @@ -0,0 +1,109 @@
> > > +#include "cache.h"
> > > +#include "prefix-map.h"
> > > +
> > > +static int map_cmp(const void *unused_cmp_data,
> > > +		   const void *entry,
> > > +		   const void *entry_or_key,
> > > +		   const void *unused_keydata)
> > > +{
> > > +	const struct prefix_map_entry *a = entry;
> > > +	const struct prefix_map_entry *b = entry_or_key;
> > > +
> > > +	return a->prefix_length != b->prefix_length ||
> > > +		strncmp(a->name, b->name, a->prefix_length);
> > > +}
> > > +
> > > +static void add_prefix_entry(struct hashmap *map, const char *name,
> > > +			     size_t prefix_length, struct prefix_item *item)
> > > +{
> > > +	struct prefix_map_entry *result = xmalloc(sizeof(*result));
> > > +	result->name = name;
> > > +	result->prefix_length = prefix_length;
> > > +	result->item = item;
> > > +	hashmap_entry_init(result, memhash(name, prefix_length));
> > > +	hashmap_add(map, result);
> > > +}
> > > +
> > > +static void init_prefix_map(struct prefix_map *prefix_map,
> > > +			    int min_prefix_length, int max_prefix_length)
> > > +{
> > > +	hashmap_init(&prefix_map->map, map_cmp, NULL, 0);
> > > +	prefix_map->min_length = min_prefix_length;
> > > +	prefix_map->max_length = max_prefix_length;
> > > +}
> > > +
> > > +static void add_prefix_item(struct prefix_map *prefix_map,
> > > +			    struct prefix_item *item)
> > > +{
> > > +	struct prefix_map_entry e = { { NULL } }, *e2;
> > > +	int j;
> > > +
> > > +	e.item = item;
> > > +	e.name = item->name;
> > > +
> > > +	for (j = prefix_map->min_length;
> > > +	     j <= prefix_map->max_length && e.name[j]; j++) {
> > > +		/* Avoid breaking UTF-8 multi-byte sequences */

Ok, that's a good idea, but...

> > > +		if (!isascii(e.name[j]))
> > > +			break;

This doesn't seem right.  It just breaks the loop, potentially leaving
'item->prefix_length' set from the previous iteration.  Try the two
strings "123" and "12ő", in this order: they both get prefix_length =
3, i.e. the prefix ends between the two bytes of the multi-byte
character.

Furthermore, neither the docstring of find_unique_prefixes() nor the
commit message mention what is supposed to happen with paths with
multi-byte characters.  I think at least one of them should.

> > > +
> > > +		e.prefix_length = j;
> > > +		hashmap_entry_init(&e, memhash(e.name, j));
> > > +		e2 = hashmap_get(&prefix_map->map, &e, NULL);
> > > +		if (!e2) {
> > > +			/* prefix is unique at this stage */
> > > +			item->prefix_length = j;
> > > +			add_prefix_entry(&prefix_map->map, e.name, j, item);
> > > +			break;
> > > +		}
> > > +
> > > +		if (!e2->item)
> > > +			continue; /* non-unique prefix */
> > > +
> > > +		if (j != e2->item->prefix_length || memcmp(e.name, e2->name, j))
> > > +			BUG("unexpected prefix length: %d != %d (%s != %s)",
> > > +			    j, (int)e2->item->prefix_length, e.name, e2->name);
> > > +
> > > +		/* skip common prefix */
> > > +		for (; j < prefix_map->max_length && e.name[j]; j++) {
> > > +			if (e.item->name[j] != e2->item->name[j])
> > > +				break;
> > > +			add_prefix_entry(&prefix_map->map, e.name, j + 1,
> > > +					 NULL);
> > > +		}
> > > +
> > > +		/* e2 no longer refers to a unique prefix */
> > > +		if (j < prefix_map->max_length && e2->name[j]) {
> > > +			/* found a new unique prefix for e2's item */
> > > +			e2->item->prefix_length = j + 1;
> > > +			add_prefix_entry(&prefix_map->map, e2->name, j + 1,
> > > +					 e2->item);
> > > +		}
> > > +		else
> > > +			e2->item->prefix_length = 0;
> > > +		e2->item = NULL;
> > > +
> > > +		if (j < prefix_map->max_length && e.name[j]) {
> > > +			/* found a unique prefix for the item */
> > > +			e.item->prefix_length = j + 1;
> > > +			add_prefix_entry(&prefix_map->map, e.name, j + 1,
> > > +					 e.item);
> > > +		} else
> > > +			/* item has no (short enough) unique prefix */
> > > +			e.item->prefix_length = 0;
> > > +
> > > +		break;
> > > +	}
> > > +}
> > > +
> > > +void find_unique_prefixes(struct prefix_item **list, size_t nr,
> > > +			  int min_length, int max_length)
> > > +{
> > > +	int i;
> > > +	struct prefix_map prefix_map;
> > > +
> > > +	init_prefix_map(&prefix_map, min_length, max_length);
> > > +	for (i = 0; i < nr; i++)
> > > +		add_prefix_item(&prefix_map, list[i]);
> > > +	hashmap_free(&prefix_map.map, 1);
> > > +}
> >
> > Between the commit message, the in-code comment, the names of the new
> > files, and implementation I was left somewhat confused about what this
> > is about and how it works.  TBH, I didn't even try to understand how
> > all the above works, in particular the add_prefix_item() function.
> 
> Let me try to explain it here, and maybe you can help me by suggesting
> an improved commit message and/or code comments?
> 
> The problem is this: given a set of items with labels (e.g. file names),
> find, for each item, the unique prefix that identifies it. Example:
> given the files `hello.txt`, `heaven.txt` and `hell.txt`, the items'
> unique prefixes would be `hello`, `hea` and `hell.`, respectively.
> 
> In `git add -i`, we actually only want to allow alphanumerical prefixes,
> and we also want at least one, and at most three characters, so only the
> second item would have an admissible unique prefix: `hea`.

You say "at most three characters", but I notive that the call added
to 'add-interactive.c' in a later commit specifies 4.

I wonder what's the reason for the minimum prefix length, though.  I
mean, you can't really have a unique prefix shorter than one
character...

> > However, I think it would be much-much simpler to first sort (a copy
> > of?) the array of prefix item pointers based on their 'name' field,
> > and then look for a unique prefix in each neighboring pair.  Perhaps
> > it would even be faster, because it doesn't have to allocate a bunch
> > of hashmap items, though I don't think that it matters much in
> > practice (i.e. I expect the number of items to be fairly small;
> > presumably nobody will run interactive add after a mass refactoring
> > modifying thousands of files).
> 
> The time complexity of the sorted list would be O(n*log(n)), while the
> hashmap-based complexity would be an amortized O(n).

Oh, so you don't know how/why it works, either!? ;)

This hashmap-based implementation calls hashmap_get() approx.
n * (max_length - min_length) times.

Furthermore, depending on the strings and 'max_length', it can call
add_prefix_entry() more than n times.  E.g. the six strings "a1",
"a2", "ab1", "ab2", "abc1", "abc2", in this order, and max_length=4
result in 9 add_prefix_entry() calls.  And each add_prefix_entry()
call involves a memory allocation.

Ok, that's theory, let's see practice.

For a (very) quick'n'dirty performance test I replaced those five
items in 't/helper/test-prefix-map.c' with the output of:

  git -C ..../webkit.git/ ls-tree -r --name-only HEAD | sed -e 's/^/{ "/; s/$/" },/'

That's 280734 real paths.  Then a simple 'time test-tool prefix-map'
with the hashmap-based implementation ran for 0.055s, while using my
PoC sort-based algorithm it ran for 0.063s.  So while the
hashmap-based implementation is indeed faster, they are both well
below 0.1s, so in the context of 'git add -i' the performance
difference between the two approaches doesn't really matter.  

If we look beyond 'git add -i's requirements, and specify larger
max_length values, the hashmap-based implementation gets progressively
slower.  Already at max_length=5 it's a tad slower than the sort-based
implementation, and at 9 it takes twice as long.

The current hashmap-based implementation is ~50% more lines than my
sort-based PoC.  I'll send it out in a minute.

> And yes, you would not _want_ to run interactive add after a mass
> refactoring. But it happens. It happens to me more times than I care to
> admit. And you know what? I really appreciate that even the Perl version
> is relatively snappy in those circumstances.
> 
> > > diff --git a/prefix-map.h b/prefix-map.h
> > > new file mode 100644
> > > index 0000000000..ce3b8a4a32
> > > --- /dev/null
> > > +++ b/prefix-map.h
> > > @@ -0,0 +1,40 @@
> > > +#ifndef PREFIX_MAP_H
> > > +#define PREFIX_MAP_H
> > >
> > > +#include "hashmap.h"
> > > +
> > > +struct prefix_item {
> > > +	const char *name;
> > > +	size_t prefix_length;
> > > +};
> >
> > This struct is part of find_unique_prefixes()'s signature, good.
> >
> > > +struct prefix_map_entry {
> > > +	struct hashmap_entry e;
> > > +	const char *name;
> > > +	size_t prefix_length;
> > > +	/* if item is NULL, the prefix is not unique */
> > > +	struct prefix_item *item;
> > > +};
> > > +
> > > +struct prefix_map {
> > > +	struct hashmap map;
> > > +	int min_length, max_length;
> > > +};
> >
> > However, neither of these two structs nor the hashmap appear in the
> > function's signature, but are all implementation details.  Therefore,
> > they should not be defined and included here in the header but in the
> > .c source file.  (But as mentioned above, I think this could be
> > implemented much simpler without these data structures.)
> 
> Right you are!
> 
> > Furthermore, this is not a map.
> > A map, in general, is a container of key-value pairs that allows
> > efficient insertion, removal and lookup.  This so-called prefix_map
> > does none of that, so it should not be called a map.
> 
> What would you call it instead?

Well, I would rather get rid of it in the first place :)

In my PoC I called the files 'unique-prefix.{c,h}'.

> > > +/*
> > > + * Find unique prefixes in a given list of strings.
> >
> > ... and stores the length of the unique prefixes in the
> > 'prefix_length' field of the elements of the given array.
> 
> Good idea. I changed it to also explain what is meant by "unique
> prefix":
> 
>  * Given a list of names, find unique prefixes (i.e. the first <n> characters
>  * that uniquely identify the names) and store the lengths of the unique
>  * prefixes in the 'prefix_length' field of the elements of the given array..

Sounds good.

> > > + *
> > > + * Typically, the `struct prefix_item` information will be but a field in the

> > > + * actual item struct; For this reason, the `list` parameter is specified as a
> > > + * list of pointers to the items.
> > > + *
> > > + * The `min_length`/`max_length` parameters define what length the unique
> > > + * prefixes should have.
> > > + *
> > > + * If no unique prefix could be found for a given item, its `prefix_length`
> > > + * will be set to 0.

I ran into cases where the hashmap-based implementation didn't set
'prefix_length' to 0 when it, in my opinion, should have, but left it
untouched.  Just try any string shorter than 'min_length'.

May or may not be related: try "ab", "ab", "abc", in this order: both
"ab" get their 'prefix_length' set to 0, as they should, but the
'prefix_length' of "abc" is left untouched, although it should be set
to 3.

Hmm, now that I've mentioned "in this order" three times already:
another benefit of the sort-based approach is that the order of
elements doesn't matter.

> > > + */
> > > +void find_unique_prefixes(struct prefix_item **list, size_t nr,
> >
> > The first argument is not a list but an array.
> 
> Indeed.
> 
> > > +			  int min_length, int max_length);
> >
> > size_t, perhaps?  These are closely related to
> > 'prefix_item.prefix_length', which is already (rightfully) size_t.
> 
> True.
> 
> > > +#endif
> > > diff --git a/t/helper/test-prefix-map.c b/t/helper/test-prefix-map.c
> > > new file mode 100644
> > > index 0000000000..3f1c90eaf0
> > > --- /dev/null
> > > +++ b/t/helper/test-prefix-map.c
> > > @@ -0,0 +1,58 @@
> > > +#include "test-tool.h"
> > > +#include "cache.h"
> > > +#include "prefix-map.h"
> > > +
> > > +static size_t test_count, failed_count;
> > > +
> > > +static void check(int succeeded, const char *file, size_t line_no,
> > > +		  const char *fmt, ...)
> > > +{
> > > +	va_list ap;
> > > +
> > > +	test_count++;
> > > +	if (succeeded)
> > > +		return;
> > > +
> > > +	va_start(ap, fmt);
> > > +	fprintf(stderr, "%s:%d: ", file, (int)line_no);
> > > +	vfprintf(stderr, fmt, ap);
> > > +	fputc('\n', stderr);
> > > +	va_end(ap);
> > > +
> > > +	failed_count++;
> > > +}
> > > +
> > > +#define EXPECT_SIZE_T_EQUALS(expect, actual, hint) \
> > > +	check(expect == actual, __FILE__, __LINE__, \
> > > +	      "size_t's do not match: %" \
> > > +	      PRIdMAX " != %" PRIdMAX " (%s) (%s)", \
> > > +	      (intmax_t)expect, (intmax_t)actual, #actual, hint)
> > > +
> > > +int cmd__prefix_map(int argc, const char **argv)
> > > +{
> > > +#define NR 5
> > > +	struct prefix_item items[NR] = {
> >
> > You don't have to tell the compiler how many elements this array will
> > contain, it will figure that out on its own.
> >
> > > +		{ "unique" },
> > > +		{ "hell" },
> > > +		{ "hello" },
> > > +		{ "wok" },
> > > +		{ "world" },
> > > +	};
> > > +	struct prefix_item *list[NR] = {
> >
> > Likewise.
> 
> That is correct.
> 
> What the compiler _cannot_ figure out, on its own, however, is that
> `items` and `list` _need_ to contain the same number of items.
> 
> Hence the need for `NR`.

OK, that's a good point.  I'd still like to get rid of that NR with
something like this instead:

  struct prefix_item items[] = { .... };
  struct prefix_item *list[ARRAY_SIZE(items)];
  for (i = 0; i < ARRAY_SIZE(items); i++) {
      list[i] = &items[i];
      list[i]->prefix_length = 12345; /* dummy */
  }
  find_unique_prefixes(list, ARRAY_SIZE(list), 0, 3);

Note the dummy value, to catch cases when 'prefix_length' is left
untouched.

This made this test tool more convenient to use, because I didn't have
to fiddle with NR and, worse, with those 'items, items+1, items+N'.

It's still a bit PITA (having to re-build each time I change the test
vector), though, compared to driving it from stdin and verifying it
through its stdout.

> > > +		items, items + 1, items + 2, items + 3, items + 4
> > > +	};
> > > +
> > > +	find_unique_prefixes(list, NR, 1, 3);
> >
> > This could be find_unique_prefixes(list, ARRAY_SIZE(list), 1, 3), and
> > then there is no need for that NR macro anymore.
> >
> > > +
> > > +#define EXPECT_PREFIX_LENGTH_EQUALS(expect, index) \
> > > +	EXPECT_SIZE_T_EQUALS(expect, list[index]->prefix_length, \
> > > +			     list[index]->name)
> > > +
> > > +	EXPECT_PREFIX_LENGTH_EQUALS(1, 0);
> > > +	EXPECT_PREFIX_LENGTH_EQUALS(0, 1);
> > > +	EXPECT_PREFIX_LENGTH_EQUALS(0, 2);
> > > +	EXPECT_PREFIX_LENGTH_EQUALS(3, 3);
> > > +	EXPECT_PREFIX_LENGTH_EQUALS(3, 4);
> > > +
> > > +	return !!failed_count;
> > > +}
> 
> Thank you for your review!
> Dscho

Patch
diff mbox series

diff --git a/Makefile b/Makefile
index 0a47200f47..545a0e8743 100644
--- a/Makefile
+++ b/Makefile
@@ -725,6 +725,7 @@  TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-path-utils.o
 TEST_BUILTINS_OBJS += test-pkt-line.o
+TEST_BUILTINS_OBJS += test-prefix-map.o
 TEST_BUILTINS_OBJS += test-prio-queue.o
 TEST_BUILTINS_OBJS += test-reach.o
 TEST_BUILTINS_OBJS += test-read-cache.o
@@ -943,6 +944,7 @@  LIB_OBJS += patch-ids.o
 LIB_OBJS += path.o
 LIB_OBJS += pathspec.o
 LIB_OBJS += pkt-line.o
+LIB_OBJS += prefix-map.o
 LIB_OBJS += preload-index.o
 LIB_OBJS += pretty.o
 LIB_OBJS += prio-queue.o
diff --git a/prefix-map.c b/prefix-map.c
new file mode 100644
index 0000000000..747ddb4ebc
--- /dev/null
+++ b/prefix-map.c
@@ -0,0 +1,109 @@ 
+#include "cache.h"
+#include "prefix-map.h"
+
+static int map_cmp(const void *unused_cmp_data,
+		   const void *entry,
+		   const void *entry_or_key,
+		   const void *unused_keydata)
+{
+	const struct prefix_map_entry *a = entry;
+	const struct prefix_map_entry *b = entry_or_key;
+
+	return a->prefix_length != b->prefix_length ||
+		strncmp(a->name, b->name, a->prefix_length);
+}
+
+static void add_prefix_entry(struct hashmap *map, const char *name,
+			     size_t prefix_length, struct prefix_item *item)
+{
+	struct prefix_map_entry *result = xmalloc(sizeof(*result));
+	result->name = name;
+	result->prefix_length = prefix_length;
+	result->item = item;
+	hashmap_entry_init(result, memhash(name, prefix_length));
+	hashmap_add(map, result);
+}
+
+static void init_prefix_map(struct prefix_map *prefix_map,
+			    int min_prefix_length, int max_prefix_length)
+{
+	hashmap_init(&prefix_map->map, map_cmp, NULL, 0);
+	prefix_map->min_length = min_prefix_length;
+	prefix_map->max_length = max_prefix_length;
+}
+
+static void add_prefix_item(struct prefix_map *prefix_map,
+			    struct prefix_item *item)
+{
+	struct prefix_map_entry e = { { NULL } }, *e2;
+	int j;
+
+	e.item = item;
+	e.name = item->name;
+
+	for (j = prefix_map->min_length;
+	     j <= prefix_map->max_length && e.name[j]; j++) {
+		/* Avoid breaking UTF-8 multi-byte sequences */
+		if (!isascii(e.name[j]))
+			break;
+
+		e.prefix_length = j;
+		hashmap_entry_init(&e, memhash(e.name, j));
+		e2 = hashmap_get(&prefix_map->map, &e, NULL);
+		if (!e2) {
+			/* prefix is unique at this stage */
+			item->prefix_length = j;
+			add_prefix_entry(&prefix_map->map, e.name, j, item);
+			break;
+		}
+
+		if (!e2->item)
+			continue; /* non-unique prefix */
+
+		if (j != e2->item->prefix_length || memcmp(e.name, e2->name, j))
+			BUG("unexpected prefix length: %d != %d (%s != %s)",
+			    j, (int)e2->item->prefix_length, e.name, e2->name);
+
+		/* skip common prefix */
+		for (; j < prefix_map->max_length && e.name[j]; j++) {
+			if (e.item->name[j] != e2->item->name[j])
+				break;
+			add_prefix_entry(&prefix_map->map, e.name, j + 1,
+					 NULL);
+		}
+
+		/* e2 no longer refers to a unique prefix */
+		if (j < prefix_map->max_length && e2->name[j]) {
+			/* found a new unique prefix for e2's item */
+			e2->item->prefix_length = j + 1;
+			add_prefix_entry(&prefix_map->map, e2->name, j + 1,
+					 e2->item);
+		}
+		else
+			e2->item->prefix_length = 0;
+		e2->item = NULL;
+
+		if (j < prefix_map->max_length && e.name[j]) {
+			/* found a unique prefix for the item */
+			e.item->prefix_length = j + 1;
+			add_prefix_entry(&prefix_map->map, e.name, j + 1,
+					 e.item);
+		} else
+			/* item has no (short enough) unique prefix */
+			e.item->prefix_length = 0;
+
+		break;
+	}
+}
+
+void find_unique_prefixes(struct prefix_item **list, size_t nr,
+			  int min_length, int max_length)
+{
+	int i;
+	struct prefix_map prefix_map;
+
+	init_prefix_map(&prefix_map, min_length, max_length);
+	for (i = 0; i < nr; i++)
+		add_prefix_item(&prefix_map, list[i]);
+	hashmap_free(&prefix_map.map, 1);
+}
diff --git a/prefix-map.h b/prefix-map.h
new file mode 100644
index 0000000000..ce3b8a4a32
--- /dev/null
+++ b/prefix-map.h
@@ -0,0 +1,40 @@ 
+#ifndef PREFIX_MAP_H
+#define PREFIX_MAP_H
+
+#include "hashmap.h"
+
+struct prefix_item {
+	const char *name;
+	size_t prefix_length;
+};
+
+struct prefix_map_entry {
+	struct hashmap_entry e;
+	const char *name;
+	size_t prefix_length;
+	/* if item is NULL, the prefix is not unique */
+	struct prefix_item *item;
+};
+
+struct prefix_map {
+	struct hashmap map;
+	int min_length, max_length;
+};
+
+/*
+ * Find unique prefixes in a given list of strings.
+ *
+ * Typically, the `struct prefix_item` information will be but a field in the
+ * actual item struct; For this reason, the `list` parameter is specified as a
+ * list of pointers to the items.
+ *
+ * The `min_length`/`max_length` parameters define what length the unique
+ * prefixes should have.
+ *
+ * If no unique prefix could be found for a given item, its `prefix_length`
+ * will be set to 0.
+ */
+void find_unique_prefixes(struct prefix_item **list, size_t nr,
+			  int min_length, int max_length);
+
+#endif
diff --git a/t/helper/test-prefix-map.c b/t/helper/test-prefix-map.c
new file mode 100644
index 0000000000..3f1c90eaf0
--- /dev/null
+++ b/t/helper/test-prefix-map.c
@@ -0,0 +1,58 @@ 
+#include "test-tool.h"
+#include "cache.h"
+#include "prefix-map.h"
+
+static size_t test_count, failed_count;
+
+static void check(int succeeded, const char *file, size_t line_no,
+		  const char *fmt, ...)
+{
+	va_list ap;
+
+	test_count++;
+	if (succeeded)
+		return;
+
+	va_start(ap, fmt);
+	fprintf(stderr, "%s:%d: ", file, (int)line_no);
+	vfprintf(stderr, fmt, ap);
+	fputc('\n', stderr);
+	va_end(ap);
+
+	failed_count++;
+}
+
+#define EXPECT_SIZE_T_EQUALS(expect, actual, hint) \
+	check(expect == actual, __FILE__, __LINE__, \
+	      "size_t's do not match: %" \
+	      PRIdMAX " != %" PRIdMAX " (%s) (%s)", \
+	      (intmax_t)expect, (intmax_t)actual, #actual, hint)
+
+int cmd__prefix_map(int argc, const char **argv)
+{
+#define NR 5
+	struct prefix_item items[NR] = {
+		{ "unique" },
+		{ "hell" },
+		{ "hello" },
+		{ "wok" },
+		{ "world" },
+	};
+	struct prefix_item *list[NR] = {
+		items, items + 1, items + 2, items + 3, items + 4
+	};
+
+	find_unique_prefixes(list, NR, 1, 3);
+
+#define EXPECT_PREFIX_LENGTH_EQUALS(expect, index) \
+	EXPECT_SIZE_T_EQUALS(expect, list[index]->prefix_length, \
+			     list[index]->name)
+
+	EXPECT_PREFIX_LENGTH_EQUALS(1, 0);
+	EXPECT_PREFIX_LENGTH_EQUALS(0, 1);
+	EXPECT_PREFIX_LENGTH_EQUALS(0, 2);
+	EXPECT_PREFIX_LENGTH_EQUALS(3, 3);
+	EXPECT_PREFIX_LENGTH_EQUALS(3, 4);
+
+	return !!failed_count;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 087a8c0cc9..e076771b96 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -39,6 +39,7 @@  static struct test_cmd cmds[] = {
 	{ "parse-options", cmd__parse_options },
 	{ "path-utils", cmd__path_utils },
 	{ "pkt-line", cmd__pkt_line },
+	{ "prefix-map", cmd__prefix_map },
 	{ "prio-queue", cmd__prio_queue },
 	{ "reach", cmd__reach },
 	{ "read-cache", cmd__read_cache },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 7e703f3038..891bac703b 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -29,6 +29,7 @@  int cmd__online_cpus(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__path_utils(int argc, const char **argv);
 int cmd__pkt_line(int argc, const char **argv);
+int cmd__prefix_map(int argc, const char **argv);
 int cmd__prio_queue(int argc, const char **argv);
 int cmd__reach(int argc, const char **argv);
 int cmd__read_cache(int argc, const char **argv);
diff --git a/t/t0018-prefix-map.sh b/t/t0018-prefix-map.sh
new file mode 100755
index 0000000000..187fa92aec
--- /dev/null
+++ b/t/t0018-prefix-map.sh
@@ -0,0 +1,10 @@ 
+#!/bin/sh
+
+test_description='basic tests for prefix map'
+. ./test-lib.sh
+
+test_expect_success 'prefix map' '
+	test-tool prefix-map
+'
+
+test_done