[6/9] oid-array: provide a for-loop iterator

Message ID	X8qFo+GJJTbaPV58@coredump.intra.peff.net (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Date: Fri, 4 Dec 2020 13:53:23 -0500 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Cc: Derrick Stolee <dstolee@microsoft.com> Subject: [PATCH 6/9] oid-array: provide a for-loop iterator Message-ID: <X8qFo+GJJTbaPV58@coredump.intra.peff.net> References: <X8qEg/KiAQDugPC0@coredump.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <X8qEg/KiAQDugPC0@coredump.intra.peff.net> Precedence: bulk
Series	misc commit-graph and oid-array cleanups \| expand [0/9] misc commit-graph and oid-array cleanups [1/9] oid-array.h: drop sha1 mention from header guard [2/9] t0064: drop sha1 mention from filename [3/9] t0064: make duplicate tests more robust [4/9] cache.h: move hash/oid functions to hash.h [5/9] oid-array: make sort function public [6/9] oid-array: provide a for-loop iterator [7/9] commit-graph: drop count_distinct_commits() function [8/9] commit-graph: replace packed_oid_list with oid_array [9/9] commit-graph: use size_t for array allocation and indexing

Jeff King Dec. 4, 2020, 6:53 p.m. UTC

We provide oid_array_for_each_unique() for iterating over the
de-duplicated items in an array. But it's awkward to use for two
reasons:

  1. It uses a callback, which means marshaling arguments into a struct
     and passing it to the callback with a void parameter.

  2. The callback doesn't know the numeric index of the oid we're
     looking at. This is useful for things like progress meters.

Iterating with a for-loop is much more natural for some cases, but the
caller has to do the de-duping itself. However, we can provide a small
helper to make this easier (see the docstring in the header for an
example use).

The caller does have to remember to sort the array first. We could add
an assertion into the helper that array->sorted is set, but I didn't
want to complicate what is otherwise a pretty fast code path.

I also considered adding a full iterator type with init/next/end
functions (similar to what we have for hashmaps). But it ended up making
the callers much harder to read. This version keeps us close to a basic
for-loop.

Yet another option would be adding an option to sort the array and
compact out the duplicates. This would mean iterating over the array an
extra time, though that's probably not a big deal (we did just do an
O(n log n) sort). But we'd still have to write a for-loop to iterate, so
it doesn't really make anything easier for the caller.

No new test, since we'll convert the callback iterator (which is covered
by t0064, among other callers) to use the new code.

Signed-off-by: Jeff King <peff@peff.net>
---
 oid-array.c |  7 ++-----
 oid-array.h | 22 ++++++++++++++++++++++
 2 files changed, 24 insertions(+), 5 deletions(-)

Taylor Blau Dec. 4, 2020, 7:05 p.m. UTC | #1

On Fri, Dec 04, 2020 at 01:53:23PM -0500, Jeff King wrote:
> I also considered adding a full iterator type with init/next/end
> functions (similar to what we have for hashmaps). But it ended up making
> the callers much harder to read. This version keeps us close to a basic
> for-loop.

Yeah, I think that a full-blown iterator type is overkill for this
purpose. Another possible approach could be a macro:

  #define for_each_oid_array_unique(arr, i) \
    for (i = 0; (i) < (array)->nr; i = oid_array_for_each_unique((arr), i))

but I don't think that's making anything more clear than
'oid_array_for_each_unique' already is. So, I like the approach that you
took here.

> @@ -111,4 +113,24 @@ void oid_array_filter(struct oid_array *array,
>   */
>  void oid_array_sort(struct oid_array *array);
>
> +/**
> + * Find the next unique oid in the array after position "cur". You
> + * can use this to iterate over unique elements, like:
> + *
> + *   size_t i;
> + *   oid_array_sort(array);
> + *   for (i = 0; i < array->nr; i = oid_array_next_unique(array, i))
> + *	printf("%s", oid_to_hex(array->oids[i]);
> + *
> + * Non-unique iteration can just increment with "i++" to visit each element.
> + */
> +static inline size_t oid_array_next_unique(struct oid_array *array, size_t cur)
> +{
> +	do {
> +		cur++;
> +	} while (cur < array->nr &&
> +		 oideq(array->oid + cur, array->oid + cur - 1));

I don't love the pointer math here (would instead prefer
oideq(&array->oid[cur]) and so on), but I don't think that it matters
enough to make a difference.

I additionally had to make sure that cur - 1 >= 0 so that the second
argument would always be valid, but it is, since we call cur++.

You could check that cur++ doesn't overflow, but I think that that's
mostly academic.

Thanks,
Taylor

Taylor Blau Dec. 4, 2020, 7:11 p.m. UTC | #2

On Fri, Dec 04, 2020 at 02:05:12PM -0500, Taylor Blau wrote:
> On Fri, Dec 04, 2020 at 01:53:23PM -0500, Jeff King wrote:
> > I also considered adding a full iterator type with init/next/end
> > functions (similar to what we have for hashmaps). But it ended up making
> > the callers much harder to read. This version keeps us close to a basic
> > for-loop.
>
> Yeah, I think that a full-blown iterator type is overkill for this
> purpose. Another possible approach could be a macro:
>
>   #define for_each_oid_array_unique(arr, i) \
>     for (i = 0; (i) < (array)->nr; i = oid_array_for_each_unique((arr), i))
>
> but I don't think that's making anything more clear than
> 'oid_array_for_each_unique' already is. So, I like the approach that you
> took here.

Hmm. I take part of that back; it would simplify the caller in the eighth
patch, which first wants to sort the list before iterating it (which is
necessary, because we look at 'i = 0' before calling
oid_array_for_each_unique()).

So this could look instead like:

#define for_each_oid_array_unique(arr, i) \
    for (oid_array_sort(), i = 0; (i) < (array)->nr; i = oid_array_for_each_unique((arr), i))

Maybe a little too magical, who knows.

Eric Sunshine Dec. 4, 2020, 7:18 p.m. UTC | #3

On Fri, Dec 4, 2020 at 1:54 PM Jeff King <peff@peff.net> wrote:
> [...]
> The caller does have to remember to sort the array first. We could add
> an assertion into the helper that array->sorted is set, but I didn't
> want to complicate what is otherwise a pretty fast code path.
> [...]
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> diff --git a/oid-array.h b/oid-array.h
> @@ -111,4 +113,24 @@ void oid_array_filter(struct oid_array *array,
> +/**
> + * Find the next unique oid in the array after position "cur". You
> + * can use this to iterate over unique elements, like:
> + *
> + *   size_t i;
> + *   oid_array_sort(array);
> + *   for (i = 0; i < array->nr; i = oid_array_next_unique(array, i))
> + *     printf("%s", oid_to_hex(array->oids[i]);
> + *
> + * Non-unique iteration can just increment with "i++" to visit each element.
> + */

Minor: I see that the example code sorts the array first -- which is
necessary, as explained in the commit message -- but I wonder if it is
worth calling out explicitly in the prose:

    Find the next unique oid in the array after position `cur`.
    The array must be sorted for this to work. You can use
    this to iterate over unique elements like this:

> +static inline size_t oid_array_next_unique(struct oid_array *array, size_t cur)
> +{
> +       do {
> +               cur++;
> +       } while (cur < array->nr &&
> +                oideq(array->oid + cur, array->oid + cur - 1));
> +       return cur;
> +}

Jeff King Dec. 4, 2020, 7:51 p.m. UTC | #4

On Fri, Dec 04, 2020 at 02:05:12PM -0500, Taylor Blau wrote:

> > +static inline size_t oid_array_next_unique(struct oid_array *array, size_t cur)
> > +{
> > +	do {
> > +		cur++;
> > +	} while (cur < array->nr &&
> > +		 oideq(array->oid + cur, array->oid + cur - 1));
> 
> I don't love the pointer math here (would instead prefer
> oideq(&array->oid[cur]) and so on), but I don't think that it matters
> enough to make a difference.

I think it is a matter of preference. I actually prefer the arithmetic
(it's also what was in the code that we are replacing in
oid_array_for_each_unique(), but that is probably because I wrote that
one, too).

> I additionally had to make sure that cur - 1 >= 0 so that the second
> argument would always be valid, but it is, since we call cur++.

Yeah. I originally wrote it as:

  cur++;
  while (cur < array->nr) {
	if (!oideq(...))
		break;
	cur++;
  }

which maybe makes it more obvious that cur always gets implemented at
least once, but I found it harder to reason about the second increment.
I think the do-while expresses it pretty clearly, but we're not that
used to seeing them.

> You could check that cur++ doesn't overflow, but I think that that's
> mostly academic.

It can't overflow, because it can't ever go past array->nr (unless you
call it again after cur is already equal to array->nr, but that would be
a bug just like calling i++ after hitting the loop end would be).

-Peff

Jeff King Dec. 4, 2020, 7:52 p.m. UTC | #5

On Fri, Dec 04, 2020 at 02:11:51PM -0500, Taylor Blau wrote:

> > Yeah, I think that a full-blown iterator type is overkill for this
> > purpose. Another possible approach could be a macro:
> >
> >   #define for_each_oid_array_unique(arr, i) \
> >     for (i = 0; (i) < (array)->nr; i = oid_array_for_each_unique((arr), i))
> >
> > but I don't think that's making anything more clear than
> > 'oid_array_for_each_unique' already is. So, I like the approach that you
> > took here.
> 
> Hmm. I take part of that back; it would simplify the caller in the eighth
> patch, which first wants to sort the list before iterating it (which is
> necessary, because we look at 'i = 0' before calling
> oid_array_for_each_unique()).
> 
> So this could look instead like:
> 
> #define for_each_oid_array_unique(arr, i) \
>     for (oid_array_sort(), i = 0; (i) < (array)->nr; i = oid_array_for_each_unique((arr), i))
> 
> Maybe a little too magical, who knows.

That was one of the many variants I wrote on the way here. But yeah, I
think it is better to avoid magic macros unless they are getting us a
lot. In the case of hashmap, there is a lot of magic needed to convert
the hashmap_entry field back into the parent struct pointer. But here,
we can get away with a pretty natural looking for-loop, so I think it's
worth keeping it simple.

-Peff

Jeff King Dec. 4, 2020, 8:44 p.m. UTC | #6

On Fri, Dec 04, 2020 at 02:18:45PM -0500, Eric Sunshine wrote:

> > + * Find the next unique oid in the array after position "cur". You
> > + * can use this to iterate over unique elements, like:
> > + *
> > + *   size_t i;
> > + *   oid_array_sort(array);
> > + *   for (i = 0; i < array->nr; i = oid_array_next_unique(array, i))
> > + *     printf("%s", oid_to_hex(array->oids[i]);
> > + *
> > + * Non-unique iteration can just increment with "i++" to visit each element.
> > + */
> 
> Minor: I see that the example code sorts the array first -- which is
> necessary, as explained in the commit message -- but I wonder if it is
> worth calling out explicitly in the prose:
> 
>     Find the next unique oid in the array after position `cur`.
>     The array must be sorted for this to work. You can use
>     this to iterate over unique elements like this:

Thanks, that makes sense; I picked up your wording here.

-Peff

Eric Sunshine Dec. 4, 2020, 8:57 p.m. UTC | #7

On Fri, Dec 4, 2020 at 3:45 PM Jeff King <peff@peff.net> wrote:
> On Fri, Dec 04, 2020 at 02:18:45PM -0500, Eric Sunshine wrote:
> > Minor: I see that the example code sorts the array first -- which is
> > necessary, as explained in the commit message -- but I wonder if it is
> > worth calling out explicitly in the prose:
> >
> >     Find the next unique oid in the array after position `cur`.
> >     The array must be sorted for this to work. You can use
> >     this to iterate over unique elements like this:
>
> Thanks, that makes sense; I picked up your wording here.

It's probably not worth a re-roll just to update this comment, but if
you're re-rolling anyhow, the wording could be improved slightly:

    Find the next unique oid in the array after position `cur`.
    The array must be sorted for this to work. You can iterate
    over unique elements like this:

Junio C Hamano Dec. 4, 2020, 9:54 p.m. UTC | #8

Eric Sunshine <sunshine@sunshineco.com> writes:

> On Fri, Dec 4, 2020 at 1:54 PM Jeff King <peff@peff.net> wrote:
>> [...]
>> The caller does have to remember to sort the array first. We could add
>> an assertion into the helper that array->sorted is set, but I didn't
>> want to complicate what is otherwise a pretty fast code path.
>> [...]
>> Signed-off-by: Jeff King <peff@peff.net>
>> ---
>> diff --git a/oid-array.h b/oid-array.h
>> @@ -111,4 +113,24 @@ void oid_array_filter(struct oid_array *array,
>> +/**
>> + * Find the next unique oid in the array after position "cur". You
>> + * can use this to iterate over unique elements, like:
>> + *
>> + *   size_t i;
>> + *   oid_array_sort(array);
>> + *   for (i = 0; i < array->nr; i = oid_array_next_unique(array, i))
>> + *     printf("%s", oid_to_hex(array->oids[i]);
>> + *
>> + * Non-unique iteration can just increment with "i++" to visit each element.
>> + */
>
> Minor: I see that the example code sorts the array first -- which is
> necessary, as explained in the commit message -- but I wonder if it is
> worth calling out explicitly in the prose:
>
>     Find the next unique oid in the array after position `cur`.
>     The array must be sorted for this to work. You can use
>     this to iterate over unique elements like this:
>
>> +static inline size_t oid_array_next_unique(struct oid_array *array, size_t cur)

Perhaps the function can make it clear that it expects to be fed a
sorted array in its name, which would be even better?

>> +{
>> +       do {
>> +               cur++;
>> +       } while (cur < array->nr &&
>> +                oideq(array->oid + cur, array->oid + cur - 1));
>> +       return cur;
>> +}

Jeff King Dec. 7, 2020, 7:05 p.m. UTC | #9

On Fri, Dec 04, 2020 at 01:54:31PM -0800, Junio C Hamano wrote:

> >> +static inline size_t oid_array_next_unique(struct oid_array *array, size_t cur)
> 
> Perhaps the function can make it clear that it expects to be fed a
> sorted array in its name, which would be even better?

The name is already pretty clunky. I'm not thrilled with the idea of
making it even longer. :)

IMHO it's unlikely for somebody to stumble on it and not read the
comment or look at an existing call, if only because the function you'd
probably reach for immediately is more like oid_array_for_each_unique().

-Peff

[6/9] oid-array: provide a for-loop iterator

Commit Message

Comments

Patch