[00/20] pack-revindex: prepare for on-disk reverse index

Message ID	cover.1610129796.git.me@ttaylorr.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Fri, 8 Jan 2021 13:16:39 -0500 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: peff@peff.net, jrnieder@gmail.com Subject: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Message-ID: <cover.1610129796.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Precedence: bulk
Series	pack-revindex: prepare for on-disk reverse index \| expand [00/20] pack-revindex: prepare for on-disk reverse index [01/20] pack-revindex: introduce a new API [02/20] write_reuse_object(): convert to new revindex API [03/20] write_reused_pack_one(): convert to new revindex API [04/20] write_reused_pack_verbatim(): convert to new revindex API [05/20] check_object(): convert to new revindex API [06/20] bitmap_position_packfile(): convert to new revindex API [07/20] show_objects_for_type(): convert to new revindex API [08/20] get_size_by_pos(): convert to new revindex API [09/20] try_partial_reuse(): convert to new revindex API [10/20] rebuild_existing_bitmaps(): convert to new revindex API [11/20] get_delta_base_oid(): convert to new revindex API [12/20] retry_bad_packed_offset(): convert to new revindex API [13/20] packed_object_info(): convert to new revindex API [14/20] unpack_entry(): convert to new revindex API [15/20] for_each_object_in_pack(): convert to new revindex API [16/20] builtin/gc.c: guess the size of the revindex [17/20] pack-revindex: remove unused 'find_pack_revindex()' [18/20] pack-revindex: remove unused 'find_revindex_position()' [19/20] pack-revindex: hide the definition of 'revindex_entry' [20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'

Taylor Blau Jan. 8, 2021, 6:16 p.m. UTC

Hi,

This is the first of two series to introduce an on-disk alternative to the
reverse index which is currently generated once per-process and stored in
memory.

As a reminder, the reverse index is used to translate an object's position
relative to other objects within the same packfile to that object's position
relative to the pack's index.

Generating the reverse index in memory for repositories with large packs has two
significant drawbacks:

  - It requires allocating sizeof(struct revindex_entry) per packed object.

  - It requires us to sort the entries by their pack offset. This is implemented
    in sort_revindex() using a radix sort, but still takes considerable time (as
    benchmarks found in the second series demonstrate).

Both of these can be addressed by storing the reverse index in a new '.rev' file
alongside the packs. This file is written once (during pack creation), and does
not require sorting when accessed, since it is stored in a sorted order.

This series lays the groundwork necessary to introduce the file format (which is
implemented in the second series). I'm splitting these two up since I think the
changes in this series can be queued independently from those in the latter
series.

The goal of this series is to remove direct access of the `struct
revindex_entry` type, as well as `struct packed_git`'s `revindex` field. The
on-disk format will be mmap'd and accessed directly, but the format is
sufficiently different that the whole `revindex` array can't be written as-is.

Separately from our main goal, the new API that is introduced in this series is
IMHO easier to read, and more clearly describes what order the given and
returned values are relative to.

The changes are structured as follows:

  - First, a new API is proposed.

  - Then, uses of the old API are removed one by one and replaced with their
    new counterparts.

  - Finally, without any callers remaining, the old API is removed.

Thanks in advance for your review.

Taylor Blau (20):
  pack-revindex: introduce a new API
  write_reuse_object(): convert to new revindex API
  write_reused_pack_one(): convert to new revindex API
  write_reused_pack_verbatim(): convert to new revindex API
  check_object(): convert to new revindex API
  bitmap_position_packfile(): convert to new revindex API
  show_objects_for_type(): convert to new revindex API
  get_size_by_pos(): convert to new revindex API
  try_partial_reuse(): convert to new revindex API
  rebuild_existing_bitmaps(): convert to new revindex API
  get_delta_base_oid(): convert to new revindex API
  retry_bad_packed_offset(): convert to new revindex API
  packed_object_info(): convert to new revindex API
  unpack_entry(): convert to new revindex API
  for_each_object_in_pack(): convert to new revindex API
  builtin/gc.c: guess the size of the revindex
  pack-revindex: remove unused 'find_pack_revindex()'
  pack-revindex: remove unused 'find_revindex_position()'
  pack-revindex: hide the definition of 'revindex_entry'
  pack-revindex.c: avoid direct revindex access in
    'offset_to_pack_pos()'

 builtin/gc.c           |  2 +-
 builtin/pack-objects.c | 33 +++++++++++++++++-----------
 pack-bitmap.c          | 44 ++++++++++++++++++-------------------
 pack-revindex.c        | 49 +++++++++++++++++++++++++++--------------
 pack-revindex.h        | 10 +++------
 packfile.c             | 50 ++++++++++++++++++++++++++----------------
 6 files changed, 109 insertions(+), 79 deletions(-)

Derrick Stolee Jan. 11, 2021, 12:07 p.m. UTC | #1

On 1/8/2021 1:16 PM, Taylor Blau wrote:
...
>   - First, a new API is proposed.
> 
>   - Then, uses of the old API are removed one by one and replaced with their
>     new counterparts.
> 
>   - Finally, without any callers remaining, the old API is removed.

This patch series has a clear layout that was easy to follow.

In a vacuum, the conversions from immediate struct member lookups
to API calls seems like adding overhead to something that is done
frequently in a loop. However, you do justify it:

> Generating the reverse index in memory for repositories with large packs has two
> significant drawbacks:
> 
>   - It requires allocating sizeof(struct revindex_entry) per packed object.
> 
>   - It requires us to sort the entries by their pack offset. This is implemented
>     in sort_revindex() using a radix sort, but still takes considerable time (as
>     benchmarks found in the second series demonstrate).
> 
> Both of these can be addressed by storing the reverse index in a new '.rev' file
> alongside the packs. This file is written once (during pack creation), and does
> not require sorting when accessed, since it is stored in a sorted order.

Even if these method calls do add a bit of overhead to each
access, it helps to not compute the table from scratch before
any access is possible.

This will be particularly valuable for operations that use only
a few position lookups, such as "is object A reachable from
commit C?"

Operations that iterate through every object in a bitmap are
more likely to notice a difference, but that will probably be
visible in the next series.

My comments on this series are very minor.

I made only one comment about "if (method() < 0)" versus
"if (method())" but that pattern appears in multiple patches.
_If_ you decide to change that pattern, then I'm sure you can
find all uses.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>

Thanks,
-Stolee

Taylor Blau Jan. 11, 2021, 4:30 p.m. UTC | #2

On Mon, Jan 11, 2021 at 07:07:17AM -0500, Derrick Stolee wrote:
> My comments on this series are very minor.
>
> I made only one comment about "if (method() < 0)" versus
> "if (method())" but that pattern appears in multiple patches.
> _If_ you decide to change that pattern, then I'm sure you can
> find all uses.

I have no strong opinion here, so I'm happy to defer to your or others'
judgement. My very weak opinion is that I'd just as soon leave it as-is,
but that if I'm rerolling and others would like to see it changed, then
I'm happy to do it.

> Reviewed-by: Derrick Stolee <dstolee@microsoft.com>

Thank you for your review. I think that I owe you some as well,
somewhere in the near-2,000 emails that I still have :-/.

Thanks,
Taylor

Derrick Stolee Jan. 11, 2021, 5:15 p.m. UTC | #3

On 1/11/2021 11:30 AM, Taylor Blau wrote:
> On Mon, Jan 11, 2021 at 07:07:17AM -0500, Derrick Stolee wrote:
>> My comments on this series are very minor.
>>
>> I made only one comment about "if (method() < 0)" versus
>> "if (method())" but that pattern appears in multiple patches.
>> _If_ you decide to change that pattern, then I'm sure you can
>> find all uses.
> 
> I have no strong opinion here, so I'm happy to defer to your or others'
> judgement. My very weak opinion is that I'd just as soon leave it as-is,
> but that if I'm rerolling and others would like to see it changed, then
> I'm happy to do it.

Well, I found 782 instances of ") < 0)" in the codebase, and my initial
scan of these shows they are doing exactly what you are asking. So as
far as code style goes, there is plenty of precedent.

The thing that makes me react to this is that it _looks_ like an extra
comparison. However, I'm sure the assembly instructions have the same
performance characteristics between "!= 0" and "< 0".

Thanks,
-Stolee

Taylor Blau Jan. 11, 2021, 5:29 p.m. UTC | #4

On Mon, Jan 11, 2021 at 12:15:58PM -0500, Derrick Stolee wrote:
> On 1/11/2021 11:30 AM, Taylor Blau wrote:
> > On Mon, Jan 11, 2021 at 07:07:17AM -0500, Derrick Stolee wrote:
> >> My comments on this series are very minor.
> >>
> >> I made only one comment about "if (method() < 0)" versus
> >> "if (method())" but that pattern appears in multiple patches.
> >> _If_ you decide to change that pattern, then I'm sure you can
> >> find all uses.
> >
> > I have no strong opinion here, so I'm happy to defer to your or others'
> > judgement. My very weak opinion is that I'd just as soon leave it as-is,
> > but that if I'm rerolling and others would like to see it changed, then
> > I'm happy to do it.
>
> Well, I found 782 instances of ") < 0)" in the codebase, and my initial
> scan of these shows they are doing exactly what you are asking. So as
> far as code style goes, there is plenty of precedent.

Thanks for looking, I was curious about that myself after our thread,
but I hadn't yet bothered to look.

> The thing that makes me react to this is that it _looks_ like an extra
> comparison. However, I'm sure the assembly instructions have the same
> performance characteristics between "!= 0" and "< 0".

It should make no difference. Both comparisons will do a 'cmp $0 ...'
where '...' is probably an indirect into the current frame. The '!= 0'
will use je, and the '< 0' comparison will use 'jns'. Both conditional
jumps should be implemented by checking a CPU flag only (ZF and SF,
respectively).

Not that any of this matters, it's just fun to look.

> Thanks,
> -Stolee
>
Thanks,
Taylor

Junio C Hamano Jan. 11, 2021, 6:40 p.m. UTC | #5

Derrick Stolee <stolee@gmail.com> writes:

> Well, I found 782 instances of ") < 0)" in the codebase, and my initial
> scan of these shows they are doing exactly what you are asking. So as
> far as code style goes, there is plenty of precedent.
>
> The thing that makes me react to this is that it _looks_ like an extra
> comparison. However, I'm sure the assembly instructions have the same
> performance characteristics between "!= 0" and "< 0".

I'd prefer to keep it that way for the human cost's point of view.

Perhaps it could be subjective but

	if (func_that_signals_error_with_return_value() < 0)

is immediately recognizable as checking for an error to folks who
were trained to write C in POSIX environment, as "on error, return
negative" is a convention that they are familiar with.  At least to
me, your "if (func_that_signals_error_with_return_value())" looks
unnatural and makes me look at the function to see what its return
value means.

If there are helper functions that use "non-zero is an error and
zero is success" convention, we should look at them to see why they
do not do the usual "a negative is an error and a non-negative is
success".  And if the *only* reason they do so is because their
normal return do not have to give more than one kind of "success",
we should see if we can fix them to follow the usual "a negative is
an error" convention, I would think.

Thanks.

Jeff King Jan. 12, 2021, 9:45 a.m. UTC | #6

On Fri, Jan 08, 2021 at 01:16:39PM -0500, Taylor Blau wrote:

> Generating the reverse index in memory for repositories with large packs has two
> significant drawbacks:
> 
>   - It requires allocating sizeof(struct revindex_entry) per packed object.
> 
>   - It requires us to sort the entries by their pack offset. This is implemented
>     in sort_revindex() using a radix sort, but still takes considerable time (as
>     benchmarks found in the second series demonstrate).

Or thinking about it more fundamentally: any operation which touches the
revindex is now O(nr_objects_in_repo), even if it only cares about a few
objects. Ideally this will eventually be this O(log nr_objects_in_repo);
we can't do much better than that because of object lookups (unless we
replace the .idx with a perfect hash or something).

> The goal of this series is to remove direct access of the `struct
> revindex_entry` type, as well as `struct packed_git`'s `revindex` field. The
> on-disk format will be mmap'd and accessed directly, but the format is
> sufficiently different that the whole `revindex` array can't be written as-is.

It looks good overall to me. I left a few nits around documentation and
integer types that I think are worth a re-roll, but I think after
addressing those it should be good.

-Peff

Taylor Blau Jan. 12, 2021, 5:07 p.m. UTC | #7

On Tue, Jan 12, 2021 at 04:45:47AM -0500, Jeff King wrote:
> > The goal of this series is to remove direct access of the `struct
> > revindex_entry` type, as well as `struct packed_git`'s `revindex` field. The
> > on-disk format will be mmap'd and accessed directly, but the format is
> > sufficiently different that the whole `revindex` array can't be written as-is.
>
> It looks good overall to me. I left a few nits around documentation and
> integer types that I think are worth a re-roll, but I think after
> addressing those it should be good.

Thanks so much for your review. I think I addressed all of your
feedback, but I'll sit on the revised patches for another day or so in
case other reviewers would like to chime in before v2.

Hopefully the this re-roll is the only one we'll need, since the next
series is much more interesting!

Thanks,
Taylor

Junio C Hamano Jan. 13, 2021, 12:23 a.m. UTC | #8

Taylor Blau <me@ttaylorr.com> writes:

> This is the first of two series to introduce an on-disk alternative to the
> reverse index which is currently generated once per-process and stored in
> memory.

Queued; seems to be killed on Windows but otherwise looking good.

  https://github.com/git/git/runs/1691849288?check_suite_focus=true#step:6:164

Thanks.

Taylor Blau Jan. 13, 2021, 12:52 a.m. UTC | #9

On Tue, Jan 12, 2021 at 04:23:00PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > This is the first of two series to introduce an on-disk alternative to the
> > reverse index which is currently generated once per-process and stored in
> > memory.
>
> Queued; seems to be killed on Windows but otherwise looking good.

Thanks.

>   https://github.com/git/git/runs/1691849288?check_suite_focus=true#step:6:164

I will look into those failures. It looks like these are all due to
running 'git index-pack' when the '*.idx' file already exists (resulting
in Windows being unable to write over the file).

Did you want to queue these two topics separately? It looks like you
picked them up in 'seen' as one big topic in a8f367dae2 (Merge branch
'tb/pack-revindex-api' into jch, 2021-01-12), but I thought you may want
to pick parts one and two up separately to allow them to progress
independently of one another.

In either case, I am planning on sending a new version of the first
series tomorrow to address Peff's feedback.

> Thanks.

Thanks,
Taylor

Junio C Hamano Jan. 13, 2021, 2:15 a.m. UTC | #10

Taylor Blau <me@ttaylorr.com> writes:

> Did you want to queue these two topics separately?

Unless it is a reasonable amount of certainty that the bottom topic
will not have to be rerolled, I'd rather not to have separate topics
on top of each other.

It is tedious and error prone, having to rebase the old iteration of
the top topic on top when a new iteration of the bottom topic comes
out.  I'd rather see that the top one get rerolled whenever the
bottom one gets rerolled to make life simpler.

Even in a single topic, I would encourage people to put the
foundational and preparatory work early so that we can make them
solidify before review really gets to later parts of the series.

And when a series is structured like so, it is perfectly fine to say
something like:

    Here is a new iteration of the last 7 patches---the early 13
    patches are the same as the previous round, so reset the topic
    to bb6414ab (packfile: prepare for the existence of '*.rev'
    files, 2021-01-08) and then apply these 7.

if you do not want to send all patches in a nontrivial series when
it gets updated.

Thanks.

Taylor Blau Jan. 13, 2021, 3:23 a.m. UTC | #11

On Tue, Jan 12, 2021 at 06:15:11PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Did you want to queue these two topics separately?
>
> Unless it is a reasonable amount of certainty that the bottom topic
> will not have to be rerolled, I'd rather not to have separate topics
> on top of each other.

Hmm. I certainly do not want to create any tedious work for you, but I
do think that we're in a situation where the bottom topic is stable and
the interesting discussion will happen in the newer topic.

> It is tedious and error prone, having to rebase the old iteration of
> the top topic on top when a new iteration of the bottom topic comes
> out.  I'd rather see that the top one get rerolled whenever the
> bottom one gets rerolled to make life simpler.

Indeed, my local v2 of the bottom topic requires the top topic to be
rebased onto it, and so I'll plan to send a v2 of each tomorrow morning.

> Even in a single topic, I would encourage people to put thet
> foundational and preparatory work early so that we can make them
> solidify before review really gets to later parts of the series.
>
> And when a series is structured like so, it is perfectly fine to say
> something like:
>
>     Here is a new iteration of the last 7 patches---the early 13
>     patches are the same as the previous round, so reset the topic
>     to bb6414ab (packfile: prepare for the existence of '*.rev'
>     files, 2021-01-08) and then apply these 7.
>
> if you do not want to send all patches in a nontrivial series when
> it gets updated.

I am happy to do it this way if you would prefer. If so, you may want to
rename this topic while queuing, since it is not just about a new
revindex API, but rather about introducing the on-disk format. (I would
suggest tb/on-disk-revindex, if you were looking for alternate names).

If you agree that the bottom topic is stable, I'd prefer to send the top
one separately. Otherwise, I can send both together. Let me know.

> Thanks.

Thanks,
Taylor

Junio C Hamano Jan. 13, 2021, 8:21 a.m. UTC | #12

Taylor Blau <me@ttaylorr.com> writes:

> If you agree that the bottom topic is stable, I'd prefer to send the top
> one separately. Otherwise, I can send both together. Let me know.

I do not expect the first 20 of the 20+8 patches to be stable from
the beginning---in fact, after reading 01/20 myself, and seeing a
few of Peff's reviews, I expect that you'll be redoing at least some
of them.

When we deal with this kind of situation (not limited to these
topics), let's make it a tradition to first pretend that we have a
long single topic, and expect that the early rounds of review focus
on two things:

 (1) to identify the best ordering of the commits (topics from
     experienced contributors like you are likely to be already
     structured in a good order, so reviewers may only have to say
     "the ordering looks good", but sometimes they may want to say
     thinks like "it would be better to leave patches 5, 8 and 11 to
     much later in the series".

 (2) to find good points to divide the series into two (or more)
     pieces, and spend more effort on helping the bottom part to
     solidify faster.

That way, the bottom part can be merged sooner to 'next' than the
rest.  It always is cumbersome to have some part of the series in
'next' and remainder in 'seen', so at that point, the lower half
would naturally gain a different name before it gets merged to
'next', I would think.

Thanks.

Jeff King Jan. 13, 2021, 1:13 p.m. UTC | #13

On Wed, Jan 13, 2021 at 12:21:12AM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > If you agree that the bottom topic is stable, I'd prefer to send the top
> > one separately. Otherwise, I can send both together. Let me know.
> 
> I do not expect the first 20 of the 20+8 patches to be stable from
> the beginning---in fact, after reading 01/20 myself, and seeing a
> few of Peff's reviews, I expect that you'll be redoing at least some
> of them.

They'll definitely need at least one re-roll. But I think Taylor is
expecting (and I do too) that the second half will probably have a lot
more back-and-forth over the on-disk format, and hence need more
re-rolls.

My main concern is reviewer fatigue. 28 patches is a lot. If we can
solidify the first 20 and then let people focus on the final 8
separately, that helps. If you're OK with splitting a topic and saying
"this is a re-roll of just the last 8 patches", then that problem is
solved. But IMHO it is easier to just point out that split from the
start than it is to come up with it after the fact. It tells reviewers
what to expect from the get-go.

>  (2) to find good points to divide the series into two (or more)
>      pieces, and spend more effort on helping the bottom part to
>      solidify faster.

I think we just did that preemptively. ;) In these two particular
series, the first 20 (or at least the first 19) are an improvement by
themselves. I think they would be worth doing even without the final 8,
both to make calling code more readable and to add new error checks and
assertions to revindex access.

> That way, the bottom part can be merged sooner to 'next' than the
> rest.  It always is cumbersome to have some part of the series in
> 'next' and remainder in 'seen', so at that point, the lower half
> would naturally gain a different name before it gets merged to
> 'next', I would think.

That seems to me like it ends up being _more_ work than just making them
into two branches in the first place.

So I guess I remain skeptical that ad-hoc splitting of longer series is
easier than doing so up front. But you're the one who does all of the
branch shuffling in the end, so if you really prefer longer series, I'm
not what I think matters that much. ;)

-Peff

Taylor Blau Jan. 13, 2021, 3:34 p.m. UTC | #14

On Wed, Jan 13, 2021 at 08:13:57AM -0500, Jeff King wrote:
> On Wed, Jan 13, 2021 at 12:21:12AM -0800, Junio C Hamano wrote:
>
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > If you agree that the bottom topic is stable, I'd prefer to send the top
> > > one separately. Otherwise, I can send both together. Let me know.
> >
> > I do not expect the first 20 of the 20+8 patches to be stable from
> > the beginning---in fact, after reading 01/20 myself, and seeing a
> > few of Peff's reviews, I expect that you'll be redoing at least some
> > of them.
>
> They'll definitely need at least one re-roll. But I think Taylor is
> expecting (and I do too) that the second half will probably have a lot
> more back-and-forth over the on-disk format, and hence need more
> re-rolls.

Indeed, I find the first ~20 patches fairly benign, and I think that the
interesting discussion will and should take place over the final 8
patches.

For what it's worth, I was referring to the pending re-roll I have of
the first 20 patches as the stable one. (Peff notes in [1] that he
thought there wasn't much else to consider beyond his comments.)

> My main concern is reviewer fatigue. 28 patches is a lot. If we can
> solidify the first 20 and then let people focus on the final 8
> separately, that helps. If you're OK with splitting a topic and saying
> "this is a re-roll of just the last 8 patches", then that problem is
> solved. But IMHO it is easier to just point out that split from the
> start than it is to come up with it after the fact. It tells reviewers
> what to expect from the get-go.

Yes, exactly.

> > That way, the bottom part can be merged sooner to 'next' than the
> > rest.  It always is cumbersome to have some part of the series in
> > 'next' and remainder in 'seen', so at that point, the lower half
> > would naturally gain a different name before it gets merged to
> > 'next', I would think.
>
> That seems to me like it ends up being _more_ work than just making them
> into two branches in the first place.

I agree, but I also wasn't aware that you would consider queuing part of
a series. If that's the route you want to take, I'm OK with that. But I
tend to agree with Peff that (in this case since a clear deliniation
already exists) it may save us time to just send two separate series
from the get-go.

> So I guess I remain skeptical that ad-hoc splitting of longer series is
> easier than doing so up front. But you're the one who does all of the
> branch shuffling in the end, so if you really prefer longer series, I'm
> not what I think matters that much. ;)

Agreed. Junio, let me know which you'd prefer in this case (I'm not sure
if the additional context has changed your mind or not).

> -Peff

Thanks,
Taylor

[1]: https://lore.kernel.org/git/X%2F1vy3D10wDEZNva@coredump.intra.peff.net/

Junio C Hamano Jan. 13, 2021, 8:06 p.m. UTC | #15

Taylor Blau <me@ttaylorr.com> writes:

>> > That way, the bottom part can be merged sooner to 'next' than the
>> > rest.  It always is cumbersome to have some part of the series in
>> > 'next' and remainder in 'seen', so at that point, the lower half
>> > would naturally gain a different name before it gets merged to
>> > 'next', I would think.
>>
>> That seems to me like it ends up being _more_ work than just making them
>> into two branches in the first place.

More work to contributors?  How?

As long as each of 20-patch and 8-patch series is marked clearly to
manage the risk of mistakes and confusion down to the same level as
a single long series, I am perfectly OK.

Examples of help contributors could have made, which would have
avoided past confusion (these are not "potential" ones, but I had to
redo day's intergration in the past because of one long topic
building on top of another) are:

 - When sending either topic, not limited to the initial round but
   in all the subsequent rounds, remind that the top topic is to be
   applied on top of the bottom topic.

 - When updating the bottom topic (e.g. 20-patch one in this case),
   send out the top one (e.g. 8-patch one), too (or instruct me to
   discard the top one tentatively).

The worst case that happened in the past was that a quite minor
tweak was made to a bottom topic that was depended on another topic,
so I just queued the new iteration of the bottom topic again,
without realizing that the other one needed to be rebased.  We ended
up two copies of the bottom topic commits in 'pu' (these days we
call it 'seen') as the tweak was so minor that the two topics
cleanly merged into 'pu' without causing conflict.  The next bad
case was a similar situation with larger rewrite of the bottom
topic, which caused me to look at quite a big conflict and waste my
time until I realized that I was missing an updated top half.

If the inter-dependent topics that caused me trouble were managed as
a single long patch series, either with "this is a full replacement
of the new iteration" or "these are to update only the last 8
patches; apply them after rewinding the topic to commit f0e1d2c3
(gostak: distim doshes, 2021-01-08)", would have had a lot less risk
to introduce human error on this end.

> I agree, but I also wasn't aware that you would consider queuing part of
> a series. If that's the route you want to take, I'm OK with that.

Discarding broken part of a series and only queuing a good part can
happen with or without multiple topics.  Merging one topic to 'next'
but not the other also happens.  Merging early part of a topic to
'next' while leaving the rest to 'seen' is possible but I'd prefer
to avoid it.  Because of the last one, a single long topic, when a
bottom part stabilizes enough, would likely to gain a separate name
and its tip would be merged to 'next'.

> But I
> tend to agree with Peff that (in this case since a clear deliniation
> already exists) it may save us time to just send two separate series
> from the get-go.

As long as the two serieses are marked as such clearly, not just in
the initial round but in all subsequent rounds, it is OK.  But in an
unproven initial round, you may regret having to move a patch across
topics, from the bottom one to the top one or vice versa, instead of
just reordering inside a single topic.

>> So I guess I remain skeptical that ad-hoc splitting of longer series is
>> easier than doing so up front.

Nobody suggested ad-hoc splitting.  I was saying that splitting
would naturally grow out of reviews toward stabilization.

Taylor Blau Jan. 13, 2021, 8:13 p.m. UTC | #16

On Wed, Jan 13, 2021 at 12:06:08PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > But I tend to agree with Peff that (in this case since a clear
> > deliniation already exists) it may save us time to just send two
> > separate series from the get-go.
>
> As long as the two serieses are marked as such clearly, not just in
> the initial round but in all subsequent rounds, it is OK.  But in an
> unproven initial round, you may regret having to move a patch across
> topics, from the bottom one to the top one or vice versa, instead of
> just reordering inside a single topic.

Sounds good. Like I said, the last thing I want to do is create undue
burden on your or anybody else when queuing or reviewing these topics.

So, I'll try to err on the side of stating the dependency between topics
too often rather than not often enough. Hopefully the first ~20 patches
are boring enough that they will make their way to master soon enough
and we don't have to worry about it.

Ordinarily, I would have held off on the second series until more the
first one had graduated, but I felt that the pure refactorings didn't
make much sense on their own without the new file-based backend to
motivate them.

Thanks,
Taylor

Jeff King Jan. 13, 2021, 8:22 p.m. UTC | #17

On Wed, Jan 13, 2021 at 12:06:08PM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> >> > That way, the bottom part can be merged sooner to 'next' than the
> >> > rest.  It always is cumbersome to have some part of the series in
> >> > 'next' and remainder in 'seen', so at that point, the lower half
> >> > would naturally gain a different name before it gets merged to
> >> > 'next', I would think.
> >>
> >> That seems to me like it ends up being _more_ work than just making them
> >> into two branches in the first place.
> 
> More work to contributors?  How?

The quoted part is from me, so I'll respond: I didn't mean contributors,
but it seems like more work to you. I.e., you are ending up with the
same multi-branch config _and_ you have to split the branches yourself
later after seeing review.

But reading what you wrote below, the advantage is that if this does not
happen until the first part hits "next", then there is no chance of it
being rebased at that point (and thus getting rewritten out from under
the second topic).

> The worst case that happened in the past was that a quite minor
> tweak was made to a bottom topic that was depended on another topic,
> so I just queued the new iteration of the bottom topic again,
> without realizing that the other one needed to be rebased.  We ended
> up two copies of the bottom topic commits in 'pu' (these days we
> call it 'seen') as the tweak was so minor that the two topics
> cleanly merged into 'pu' without causing conflict.  The next bad
> case was a similar situation with larger rewrite of the bottom
> topic, which caused me to look at quite a big conflict and waste my
> time until I realized that I was missing an updated top half.

I somehow assumed you had more automation there. On my end, since I
rebase my topics aggressively, it is just a matter of pointing the
branch upstream in the right place. But of course that is not your
workflow at all.

I know you do have the "this branches uses" logic in your what's cooking
emails. In theory it could remind you of the situation, but I'm not sure
where in the workflow you'd insert it (by the time you run the WC
script, it is hard to realize the rebasing that _should_ have been done
earlier, unless you collate patch-ids, and even that is not 100%).

I do wonder if setting the dependent branch's @{upstream} would be
helpful here. You do not rebase all of your topics, but the ones with a
local-branch @{u} would be candidates for doing so.

All that said, I am also sensitive that my armchair "you could do it
like this..." suggesting may not be fully informed. So take it as idle
thoughts, not necessarily arguments. :)

> >> So I guess I remain skeptical that ad-hoc splitting of longer series is
> >> easier than doing so up front.
> 
> Nobody suggested ad-hoc splitting.  I was saying that splitting
> would naturally grow out of reviews toward stabilization.

This quote is me again. By "ad-hoc" I meant exactly this "after we see
some reviews" (as opposed to drawing a line up front).

-Peff

[00/20] pack-revindex: prepare for on-disk reverse index

Message

Comments