Message ID | cover.1610129796.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | pack-revindex: prepare for on-disk reverse index | expand |
On 1/8/2021 1:16 PM, Taylor Blau wrote: ... > - First, a new API is proposed. > > - Then, uses of the old API are removed one by one and replaced with their > new counterparts. > > - Finally, without any callers remaining, the old API is removed. This patch series has a clear layout that was easy to follow. In a vacuum, the conversions from immediate struct member lookups to API calls seems like adding overhead to something that is done frequently in a loop. However, you do justify it: > Generating the reverse index in memory for repositories with large packs has two > significant drawbacks: > > - It requires allocating sizeof(struct revindex_entry) per packed object. > > - It requires us to sort the entries by their pack offset. This is implemented > in sort_revindex() using a radix sort, but still takes considerable time (as > benchmarks found in the second series demonstrate). > > Both of these can be addressed by storing the reverse index in a new '.rev' file > alongside the packs. This file is written once (during pack creation), and does > not require sorting when accessed, since it is stored in a sorted order. Even if these method calls do add a bit of overhead to each access, it helps to not compute the table from scratch before any access is possible. This will be particularly valuable for operations that use only a few position lookups, such as "is object A reachable from commit C?" Operations that iterate through every object in a bitmap are more likely to notice a difference, but that will probably be visible in the next series. My comments on this series are very minor. I made only one comment about "if (method() < 0)" versus "if (method())" but that pattern appears in multiple patches. _If_ you decide to change that pattern, then I'm sure you can find all uses. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Thanks, -Stolee
On Mon, Jan 11, 2021 at 07:07:17AM -0500, Derrick Stolee wrote: > My comments on this series are very minor. > > I made only one comment about "if (method() < 0)" versus > "if (method())" but that pattern appears in multiple patches. > _If_ you decide to change that pattern, then I'm sure you can > find all uses. I have no strong opinion here, so I'm happy to defer to your or others' judgement. My very weak opinion is that I'd just as soon leave it as-is, but that if I'm rerolling and others would like to see it changed, then I'm happy to do it. > Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Thank you for your review. I think that I owe you some as well, somewhere in the near-2,000 emails that I still have :-/. Thanks, Taylor
On 1/11/2021 11:30 AM, Taylor Blau wrote: > On Mon, Jan 11, 2021 at 07:07:17AM -0500, Derrick Stolee wrote: >> My comments on this series are very minor. >> >> I made only one comment about "if (method() < 0)" versus >> "if (method())" but that pattern appears in multiple patches. >> _If_ you decide to change that pattern, then I'm sure you can >> find all uses. > > I have no strong opinion here, so I'm happy to defer to your or others' > judgement. My very weak opinion is that I'd just as soon leave it as-is, > but that if I'm rerolling and others would like to see it changed, then > I'm happy to do it. Well, I found 782 instances of ") < 0)" in the codebase, and my initial scan of these shows they are doing exactly what you are asking. So as far as code style goes, there is plenty of precedent. The thing that makes me react to this is that it _looks_ like an extra comparison. However, I'm sure the assembly instructions have the same performance characteristics between "!= 0" and "< 0". Thanks, -Stolee
On Mon, Jan 11, 2021 at 12:15:58PM -0500, Derrick Stolee wrote: > On 1/11/2021 11:30 AM, Taylor Blau wrote: > > On Mon, Jan 11, 2021 at 07:07:17AM -0500, Derrick Stolee wrote: > >> My comments on this series are very minor. > >> > >> I made only one comment about "if (method() < 0)" versus > >> "if (method())" but that pattern appears in multiple patches. > >> _If_ you decide to change that pattern, then I'm sure you can > >> find all uses. > > > > I have no strong opinion here, so I'm happy to defer to your or others' > > judgement. My very weak opinion is that I'd just as soon leave it as-is, > > but that if I'm rerolling and others would like to see it changed, then > > I'm happy to do it. > > Well, I found 782 instances of ") < 0)" in the codebase, and my initial > scan of these shows they are doing exactly what you are asking. So as > far as code style goes, there is plenty of precedent. Thanks for looking, I was curious about that myself after our thread, but I hadn't yet bothered to look. > The thing that makes me react to this is that it _looks_ like an extra > comparison. However, I'm sure the assembly instructions have the same > performance characteristics between "!= 0" and "< 0". It should make no difference. Both comparisons will do a 'cmp $0 ...' where '...' is probably an indirect into the current frame. The '!= 0' will use je, and the '< 0' comparison will use 'jns'. Both conditional jumps should be implemented by checking a CPU flag only (ZF and SF, respectively). Not that any of this matters, it's just fun to look. > Thanks, > -Stolee > Thanks, Taylor
Derrick Stolee <stolee@gmail.com> writes: > Well, I found 782 instances of ") < 0)" in the codebase, and my initial > scan of these shows they are doing exactly what you are asking. So as > far as code style goes, there is plenty of precedent. > > The thing that makes me react to this is that it _looks_ like an extra > comparison. However, I'm sure the assembly instructions have the same > performance characteristics between "!= 0" and "< 0". I'd prefer to keep it that way for the human cost's point of view. Perhaps it could be subjective but if (func_that_signals_error_with_return_value() < 0) is immediately recognizable as checking for an error to folks who were trained to write C in POSIX environment, as "on error, return negative" is a convention that they are familiar with. At least to me, your "if (func_that_signals_error_with_return_value())" looks unnatural and makes me look at the function to see what its return value means. If there are helper functions that use "non-zero is an error and zero is success" convention, we should look at them to see why they do not do the usual "a negative is an error and a non-negative is success". And if the *only* reason they do so is because their normal return do not have to give more than one kind of "success", we should see if we can fix them to follow the usual "a negative is an error" convention, I would think. Thanks.
On Fri, Jan 08, 2021 at 01:16:39PM -0500, Taylor Blau wrote: > Generating the reverse index in memory for repositories with large packs has two > significant drawbacks: > > - It requires allocating sizeof(struct revindex_entry) per packed object. > > - It requires us to sort the entries by their pack offset. This is implemented > in sort_revindex() using a radix sort, but still takes considerable time (as > benchmarks found in the second series demonstrate). Or thinking about it more fundamentally: any operation which touches the revindex is now O(nr_objects_in_repo), even if it only cares about a few objects. Ideally this will eventually be this O(log nr_objects_in_repo); we can't do much better than that because of object lookups (unless we replace the .idx with a perfect hash or something). > The goal of this series is to remove direct access of the `struct > revindex_entry` type, as well as `struct packed_git`'s `revindex` field. The > on-disk format will be mmap'd and accessed directly, but the format is > sufficiently different that the whole `revindex` array can't be written as-is. It looks good overall to me. I left a few nits around documentation and integer types that I think are worth a re-roll, but I think after addressing those it should be good. -Peff
On Tue, Jan 12, 2021 at 04:45:47AM -0500, Jeff King wrote: > > The goal of this series is to remove direct access of the `struct > > revindex_entry` type, as well as `struct packed_git`'s `revindex` field. The > > on-disk format will be mmap'd and accessed directly, but the format is > > sufficiently different that the whole `revindex` array can't be written as-is. > > It looks good overall to me. I left a few nits around documentation and > integer types that I think are worth a re-roll, but I think after > addressing those it should be good. Thanks so much for your review. I think I addressed all of your feedback, but I'll sit on the revised patches for another day or so in case other reviewers would like to chime in before v2. Hopefully the this re-roll is the only one we'll need, since the next series is much more interesting! Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > This is the first of two series to introduce an on-disk alternative to the > reverse index which is currently generated once per-process and stored in > memory. Queued; seems to be killed on Windows but otherwise looking good. https://github.com/git/git/runs/1691849288?check_suite_focus=true#step:6:164 Thanks.
On Tue, Jan 12, 2021 at 04:23:00PM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > This is the first of two series to introduce an on-disk alternative to the > > reverse index which is currently generated once per-process and stored in > > memory. > > Queued; seems to be killed on Windows but otherwise looking good. Thanks. > https://github.com/git/git/runs/1691849288?check_suite_focus=true#step:6:164 I will look into those failures. It looks like these are all due to running 'git index-pack' when the '*.idx' file already exists (resulting in Windows being unable to write over the file). Did you want to queue these two topics separately? It looks like you picked them up in 'seen' as one big topic in a8f367dae2 (Merge branch 'tb/pack-revindex-api' into jch, 2021-01-12), but I thought you may want to pick parts one and two up separately to allow them to progress independently of one another. In either case, I am planning on sending a new version of the first series tomorrow to address Peff's feedback. > Thanks. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes:
> Did you want to queue these two topics separately?
Unless it is a reasonable amount of certainty that the bottom topic
will not have to be rerolled, I'd rather not to have separate topics
on top of each other.
It is tedious and error prone, having to rebase the old iteration of
the top topic on top when a new iteration of the bottom topic comes
out. I'd rather see that the top one get rerolled whenever the
bottom one gets rerolled to make life simpler.
Even in a single topic, I would encourage people to put the
foundational and preparatory work early so that we can make them
solidify before review really gets to later parts of the series.
And when a series is structured like so, it is perfectly fine to say
something like:
Here is a new iteration of the last 7 patches---the early 13
patches are the same as the previous round, so reset the topic
to bb6414ab (packfile: prepare for the existence of '*.rev'
files, 2021-01-08) and then apply these 7.
if you do not want to send all patches in a nontrivial series when
it gets updated.
Thanks.
On Tue, Jan 12, 2021 at 06:15:11PM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > Did you want to queue these two topics separately? > > Unless it is a reasonable amount of certainty that the bottom topic > will not have to be rerolled, I'd rather not to have separate topics > on top of each other. Hmm. I certainly do not want to create any tedious work for you, but I do think that we're in a situation where the bottom topic is stable and the interesting discussion will happen in the newer topic. > It is tedious and error prone, having to rebase the old iteration of > the top topic on top when a new iteration of the bottom topic comes > out. I'd rather see that the top one get rerolled whenever the > bottom one gets rerolled to make life simpler. Indeed, my local v2 of the bottom topic requires the top topic to be rebased onto it, and so I'll plan to send a v2 of each tomorrow morning. > Even in a single topic, I would encourage people to put thet > foundational and preparatory work early so that we can make them > solidify before review really gets to later parts of the series. > > And when a series is structured like so, it is perfectly fine to say > something like: > > Here is a new iteration of the last 7 patches---the early 13 > patches are the same as the previous round, so reset the topic > to bb6414ab (packfile: prepare for the existence of '*.rev' > files, 2021-01-08) and then apply these 7. > > if you do not want to send all patches in a nontrivial series when > it gets updated. I am happy to do it this way if you would prefer. If so, you may want to rename this topic while queuing, since it is not just about a new revindex API, but rather about introducing the on-disk format. (I would suggest tb/on-disk-revindex, if you were looking for alternate names). If you agree that the bottom topic is stable, I'd prefer to send the top one separately. Otherwise, I can send both together. Let me know. > Thanks. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > If you agree that the bottom topic is stable, I'd prefer to send the top > one separately. Otherwise, I can send both together. Let me know. I do not expect the first 20 of the 20+8 patches to be stable from the beginning---in fact, after reading 01/20 myself, and seeing a few of Peff's reviews, I expect that you'll be redoing at least some of them. When we deal with this kind of situation (not limited to these topics), let's make it a tradition to first pretend that we have a long single topic, and expect that the early rounds of review focus on two things: (1) to identify the best ordering of the commits (topics from experienced contributors like you are likely to be already structured in a good order, so reviewers may only have to say "the ordering looks good", but sometimes they may want to say thinks like "it would be better to leave patches 5, 8 and 11 to much later in the series". (2) to find good points to divide the series into two (or more) pieces, and spend more effort on helping the bottom part to solidify faster. That way, the bottom part can be merged sooner to 'next' than the rest. It always is cumbersome to have some part of the series in 'next' and remainder in 'seen', so at that point, the lower half would naturally gain a different name before it gets merged to 'next', I would think. Thanks.
On Wed, Jan 13, 2021 at 12:21:12AM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > If you agree that the bottom topic is stable, I'd prefer to send the top > > one separately. Otherwise, I can send both together. Let me know. > > I do not expect the first 20 of the 20+8 patches to be stable from > the beginning---in fact, after reading 01/20 myself, and seeing a > few of Peff's reviews, I expect that you'll be redoing at least some > of them. They'll definitely need at least one re-roll. But I think Taylor is expecting (and I do too) that the second half will probably have a lot more back-and-forth over the on-disk format, and hence need more re-rolls. My main concern is reviewer fatigue. 28 patches is a lot. If we can solidify the first 20 and then let people focus on the final 8 separately, that helps. If you're OK with splitting a topic and saying "this is a re-roll of just the last 8 patches", then that problem is solved. But IMHO it is easier to just point out that split from the start than it is to come up with it after the fact. It tells reviewers what to expect from the get-go. > (2) to find good points to divide the series into two (or more) > pieces, and spend more effort on helping the bottom part to > solidify faster. I think we just did that preemptively. ;) In these two particular series, the first 20 (or at least the first 19) are an improvement by themselves. I think they would be worth doing even without the final 8, both to make calling code more readable and to add new error checks and assertions to revindex access. > That way, the bottom part can be merged sooner to 'next' than the > rest. It always is cumbersome to have some part of the series in > 'next' and remainder in 'seen', so at that point, the lower half > would naturally gain a different name before it gets merged to > 'next', I would think. That seems to me like it ends up being _more_ work than just making them into two branches in the first place. So I guess I remain skeptical that ad-hoc splitting of longer series is easier than doing so up front. But you're the one who does all of the branch shuffling in the end, so if you really prefer longer series, I'm not what I think matters that much. ;) -Peff
On Wed, Jan 13, 2021 at 08:13:57AM -0500, Jeff King wrote: > On Wed, Jan 13, 2021 at 12:21:12AM -0800, Junio C Hamano wrote: > > > Taylor Blau <me@ttaylorr.com> writes: > > > > > If you agree that the bottom topic is stable, I'd prefer to send the top > > > one separately. Otherwise, I can send both together. Let me know. > > > > I do not expect the first 20 of the 20+8 patches to be stable from > > the beginning---in fact, after reading 01/20 myself, and seeing a > > few of Peff's reviews, I expect that you'll be redoing at least some > > of them. > > They'll definitely need at least one re-roll. But I think Taylor is > expecting (and I do too) that the second half will probably have a lot > more back-and-forth over the on-disk format, and hence need more > re-rolls. Indeed, I find the first ~20 patches fairly benign, and I think that the interesting discussion will and should take place over the final 8 patches. For what it's worth, I was referring to the pending re-roll I have of the first 20 patches as the stable one. (Peff notes in [1] that he thought there wasn't much else to consider beyond his comments.) > My main concern is reviewer fatigue. 28 patches is a lot. If we can > solidify the first 20 and then let people focus on the final 8 > separately, that helps. If you're OK with splitting a topic and saying > "this is a re-roll of just the last 8 patches", then that problem is > solved. But IMHO it is easier to just point out that split from the > start than it is to come up with it after the fact. It tells reviewers > what to expect from the get-go. Yes, exactly. > > That way, the bottom part can be merged sooner to 'next' than the > > rest. It always is cumbersome to have some part of the series in > > 'next' and remainder in 'seen', so at that point, the lower half > > would naturally gain a different name before it gets merged to > > 'next', I would think. > > That seems to me like it ends up being _more_ work than just making them > into two branches in the first place. I agree, but I also wasn't aware that you would consider queuing part of a series. If that's the route you want to take, I'm OK with that. But I tend to agree with Peff that (in this case since a clear deliniation already exists) it may save us time to just send two separate series from the get-go. > So I guess I remain skeptical that ad-hoc splitting of longer series is > easier than doing so up front. But you're the one who does all of the > branch shuffling in the end, so if you really prefer longer series, I'm > not what I think matters that much. ;) Agreed. Junio, let me know which you'd prefer in this case (I'm not sure if the additional context has changed your mind or not). > -Peff Thanks, Taylor [1]: https://lore.kernel.org/git/X%2F1vy3D10wDEZNva@coredump.intra.peff.net/
Taylor Blau <me@ttaylorr.com> writes: >> > That way, the bottom part can be merged sooner to 'next' than the >> > rest. It always is cumbersome to have some part of the series in >> > 'next' and remainder in 'seen', so at that point, the lower half >> > would naturally gain a different name before it gets merged to >> > 'next', I would think. >> >> That seems to me like it ends up being _more_ work than just making them >> into two branches in the first place. More work to contributors? How? As long as each of 20-patch and 8-patch series is marked clearly to manage the risk of mistakes and confusion down to the same level as a single long series, I am perfectly OK. Examples of help contributors could have made, which would have avoided past confusion (these are not "potential" ones, but I had to redo day's intergration in the past because of one long topic building on top of another) are: - When sending either topic, not limited to the initial round but in all the subsequent rounds, remind that the top topic is to be applied on top of the bottom topic. - When updating the bottom topic (e.g. 20-patch one in this case), send out the top one (e.g. 8-patch one), too (or instruct me to discard the top one tentatively). The worst case that happened in the past was that a quite minor tweak was made to a bottom topic that was depended on another topic, so I just queued the new iteration of the bottom topic again, without realizing that the other one needed to be rebased. We ended up two copies of the bottom topic commits in 'pu' (these days we call it 'seen') as the tweak was so minor that the two topics cleanly merged into 'pu' without causing conflict. The next bad case was a similar situation with larger rewrite of the bottom topic, which caused me to look at quite a big conflict and waste my time until I realized that I was missing an updated top half. If the inter-dependent topics that caused me trouble were managed as a single long patch series, either with "this is a full replacement of the new iteration" or "these are to update only the last 8 patches; apply them after rewinding the topic to commit f0e1d2c3 (gostak: distim doshes, 2021-01-08)", would have had a lot less risk to introduce human error on this end. > I agree, but I also wasn't aware that you would consider queuing part of > a series. If that's the route you want to take, I'm OK with that. Discarding broken part of a series and only queuing a good part can happen with or without multiple topics. Merging one topic to 'next' but not the other also happens. Merging early part of a topic to 'next' while leaving the rest to 'seen' is possible but I'd prefer to avoid it. Because of the last one, a single long topic, when a bottom part stabilizes enough, would likely to gain a separate name and its tip would be merged to 'next'. > But I > tend to agree with Peff that (in this case since a clear deliniation > already exists) it may save us time to just send two separate series > from the get-go. As long as the two serieses are marked as such clearly, not just in the initial round but in all subsequent rounds, it is OK. But in an unproven initial round, you may regret having to move a patch across topics, from the bottom one to the top one or vice versa, instead of just reordering inside a single topic. >> So I guess I remain skeptical that ad-hoc splitting of longer series is >> easier than doing so up front. Nobody suggested ad-hoc splitting. I was saying that splitting would naturally grow out of reviews toward stabilization.
On Wed, Jan 13, 2021 at 12:06:08PM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > But I tend to agree with Peff that (in this case since a clear > > deliniation already exists) it may save us time to just send two > > separate series from the get-go. > > As long as the two serieses are marked as such clearly, not just in > the initial round but in all subsequent rounds, it is OK. But in an > unproven initial round, you may regret having to move a patch across > topics, from the bottom one to the top one or vice versa, instead of > just reordering inside a single topic. Sounds good. Like I said, the last thing I want to do is create undue burden on your or anybody else when queuing or reviewing these topics. So, I'll try to err on the side of stating the dependency between topics too often rather than not often enough. Hopefully the first ~20 patches are boring enough that they will make their way to master soon enough and we don't have to worry about it. Ordinarily, I would have held off on the second series until more the first one had graduated, but I felt that the pure refactorings didn't make much sense on their own without the new file-based backend to motivate them. Thanks, Taylor
On Wed, Jan 13, 2021 at 12:06:08PM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > >> > That way, the bottom part can be merged sooner to 'next' than the > >> > rest. It always is cumbersome to have some part of the series in > >> > 'next' and remainder in 'seen', so at that point, the lower half > >> > would naturally gain a different name before it gets merged to > >> > 'next', I would think. > >> > >> That seems to me like it ends up being _more_ work than just making them > >> into two branches in the first place. > > More work to contributors? How? The quoted part is from me, so I'll respond: I didn't mean contributors, but it seems like more work to you. I.e., you are ending up with the same multi-branch config _and_ you have to split the branches yourself later after seeing review. But reading what you wrote below, the advantage is that if this does not happen until the first part hits "next", then there is no chance of it being rebased at that point (and thus getting rewritten out from under the second topic). > The worst case that happened in the past was that a quite minor > tweak was made to a bottom topic that was depended on another topic, > so I just queued the new iteration of the bottom topic again, > without realizing that the other one needed to be rebased. We ended > up two copies of the bottom topic commits in 'pu' (these days we > call it 'seen') as the tweak was so minor that the two topics > cleanly merged into 'pu' without causing conflict. The next bad > case was a similar situation with larger rewrite of the bottom > topic, which caused me to look at quite a big conflict and waste my > time until I realized that I was missing an updated top half. I somehow assumed you had more automation there. On my end, since I rebase my topics aggressively, it is just a matter of pointing the branch upstream in the right place. But of course that is not your workflow at all. I know you do have the "this branches uses" logic in your what's cooking emails. In theory it could remind you of the situation, but I'm not sure where in the workflow you'd insert it (by the time you run the WC script, it is hard to realize the rebasing that _should_ have been done earlier, unless you collate patch-ids, and even that is not 100%). I do wonder if setting the dependent branch's @{upstream} would be helpful here. You do not rebase all of your topics, but the ones with a local-branch @{u} would be candidates for doing so. All that said, I am also sensitive that my armchair "you could do it like this..." suggesting may not be fully informed. So take it as idle thoughts, not necessarily arguments. :) > >> So I guess I remain skeptical that ad-hoc splitting of longer series is > >> easier than doing so up front. > > Nobody suggested ad-hoc splitting. I was saying that splitting > would naturally grow out of reviews toward stabilization. This quote is me again. By "ad-hoc" I meant exactly this "after we see some reviews" (as opposed to drawing a line up front). -Peff