[v3,0/9] Rewrite packfile reuse code
mbox series

Message ID 20191115141541.11149-1-chriscool@tuxfamily.org
Headers show
Series
  • Rewrite packfile reuse code
Related show

Message

Christian Couder Nov. 15, 2019, 2:15 p.m. UTC
This patch series is rewriting the code that tries to reuse existing
packfiles.

The code in this patch series was written by GitHub, and Peff nicely
provided it in the following discussion:

https://public-inbox.org/git/3E56B0FD-EBE8-4057-A93A-16EBB09FBCE0@jramsay.com.au/

The first versions of this patch series were also discussed:

V2: https://public-inbox.org/git/20191019103531.23274-1-chriscool@tuxfamily.org/
V1: https://public-inbox.org/git/20190913130226.7449-1-chriscool@tuxfamily.org/

Thanks to the reviewers!

According to Peff this new code is a lot smarter than what it
replaces. It allows "holes" in the chunks of packfile to be reused,
and skips over them. It rewrites OFS_DELTA offsets as it goes to
account for the holes. So it's basically a linear walk over the
packfile, but with the important distinction that we don't add those
objects to the object_entry array, which makes them very lightweight
(especially in memory use, but they also aren't considered bases for
finding new deltas, etc). It seems like a good compromise between the
cost to serve a clone and the quality of the resulting packfile.

Changes since the previous patch series are the following:

  - Rebased onto current master (d9f6f3b619, The first batch post 2.24
    cycle, 2019-11-10)

  - Remove a paragraph in the commit message of patch 3/9 as suggested
    by Jonathan Tan.

  - Improve commit message of patch 9/9 as suggested by Jonathan Tan.

  - Renamed fields of struct reused_chunk in patch 9/9 as suggested by
    Jonathan Tan.

  - Added a few comments in patch 9/9 as suggested by Jonathan Tan.

It could be a good idea if Peff could answer some of the comments made
by Jonathan Tan about patch 9/9.

I have put Peff as the author of all the commits.

Jeff King (9):
  builtin/pack-objects: report reused packfile objects
  packfile: expose get_delta_base()
  ewah/bitmap: introduce bitmap_word_alloc()
  pack-bitmap: don't rely on bitmap_git->reuse_objects
  pack-bitmap: introduce bitmap_walk_contains()
  csum-file: introduce hashfile_total()
  pack-objects: introduce pack.allowPackReuse
  builtin/pack-objects: introduce obj_is_packed()
  pack-objects: improve partial packfile reuse

 Documentation/config/pack.txt |   4 +
 builtin/pack-objects.c        | 243 +++++++++++++++++++++++++++-------
 csum-file.h                   |   9 ++
 ewah/bitmap.c                 |  13 +-
 ewah/ewok.h                   |   1 +
 pack-bitmap.c                 | 178 ++++++++++++++++++-------
 pack-bitmap.h                 |   6 +-
 packfile.c                    |  10 +-
 packfile.h                    |   3 +
 9 files changed, 357 insertions(+), 110 deletions(-)

Comments

Jonathan Tan Nov. 15, 2019, 6:03 p.m. UTC | #1
> It could be a good idea if Peff could answer some of the comments made
> by Jonathan Tan about patch 9/9.
> 
> I have put Peff as the author of all the commits.

Thanks. I think the series looks mostly good except for the questions I
raised in patch 9/9, so I'll wait for Peff to respond too.
Junio C Hamano Nov. 25, 2019, 6:30 a.m. UTC | #2
Jonathan Tan <jonathantanmy@google.com> writes:

>> It could be a good idea if Peff could answer some of the comments made
>> by Jonathan Tan about patch 9/9.
>> 
>> I have put Peff as the author of all the commits.
>
> Thanks. I think the series looks mostly good except for the questions I
> raised in patch 9/9, so I'll wait for Peff to respond too.

Hmph, the round before this one has been in 'next' for quite a
while, so should I eject it before waiting for Peff to respond
before queuing this one?

Thanks.
Junio C Hamano Nov. 25, 2019, 6:36 a.m. UTC | #3
Junio C Hamano <gitster@pobox.com> writes:

> Jonathan Tan <jonathantanmy@google.com> writes:
>
>>> It could be a good idea if Peff could answer some of the comments made
>>> by Jonathan Tan about patch 9/9.
>>> 
>>> I have put Peff as the author of all the commits.
>>
>> Thanks. I think the series looks mostly good except for the questions I
>> raised in patch 9/9, so I'll wait for Peff to respond too.
>
> Hmph, the round before this one has been in 'next' for quite a
> while, so should I eject it before waiting for Peff to respond
> before queuing this one?

After rebasing these v3 patches on top of the base of the one in
'next', the only difference seems to be the log message of 3/9 and
the contents of 9/9.  I guess I'll mark the topic as "on hold" for
now before doing anything, as I am officially taking a time-off most
of this week ;-)

Thanks.
Junio C Hamano Dec. 6, 2019, 9:42 p.m. UTC | #4
Junio C Hamano <gitster@pobox.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>>>> It could be a good idea if Peff could answer some of the comments made
>>>> by Jonathan Tan about patch 9/9.
>>>> 
>>>> I have put Peff as the author of all the commits.
>>>
>>> Thanks. I think the series looks mostly good except for the questions I
>>> raised in patch 9/9, so I'll wait for Peff to respond too.
>>
>> Hmph, the round before this one has been in 'next' for quite a
>> while, so should I eject it before waiting for Peff to respond
>> before queuing this one?
>
> After rebasing these v3 patches on top of the base of the one in
> 'next', the only difference seems to be the log message of 3/9 and
> the contents of 9/9.  I guess I'll mark the topic as "on hold" for
> now before doing anything, as I am officially taking a time-off most
> of this week ;-)

So..., that week has passed---anything new?

Thanks.
Christian Couder Dec. 7, 2019, 10:12 a.m. UTC | #5
On Fri, Dec 6, 2019 at 10:42 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Junio C Hamano <gitster@pobox.com> writes:
> >
> >> Jonathan Tan <jonathantanmy@google.com> writes:
> >>
> >>>> It could be a good idea if Peff could answer some of the comments made
> >>>> by Jonathan Tan about patch 9/9.
> >>>>
> >>>> I have put Peff as the author of all the commits.
> >>>
> >>> Thanks. I think the series looks mostly good except for the questions I
> >>> raised in patch 9/9, so I'll wait for Peff to respond too.
> >>
> >> Hmph, the round before this one has been in 'next' for quite a
> >> while, so should I eject it before waiting for Peff to respond
> >> before queuing this one?
> >
> > After rebasing these v3 patches on top of the base of the one in
> > 'next', the only difference seems to be the log message of 3/9 and
> > the contents of 9/9.  I guess I'll mark the topic as "on hold" for
> > now before doing anything, as I am officially taking a time-off most
> > of this week ;-)
>
> So..., that week has passed---anything new?

Unfortunately, no.

If you want I can send an incremental change on the content of 9/9 on
top of what's in next. Otherwise I can't see what I could do on this.

Peff, could you tell us if you might have time to take a look at this soon?
Johannes Schindelin Dec. 7, 2019, 8:47 p.m. UTC | #6
Hi Chris,

On Sat, 7 Dec 2019, Christian Couder wrote:

> On Fri, Dec 6, 2019 at 10:42 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Junio C Hamano <gitster@pobox.com> writes:
> >
> > > Junio C Hamano <gitster@pobox.com> writes:
> > >
> > >> Jonathan Tan <jonathantanmy@google.com> writes:
> > >>
> > >>>> It could be a good idea if Peff could answer some of the comments made
> > >>>> by Jonathan Tan about patch 9/9.
> > >>>>
> > >>>> I have put Peff as the author of all the commits.
> > >>>
> > >>> Thanks. I think the series looks mostly good except for the questions I
> > >>> raised in patch 9/9, so I'll wait for Peff to respond too.
> > >>
> > >> Hmph, the round before this one has been in 'next' for quite a
> > >> while, so should I eject it before waiting for Peff to respond
> > >> before queuing this one?
> > >
> > > After rebasing these v3 patches on top of the base of the one in
> > > 'next', the only difference seems to be the log message of 3/9 and
> > > the contents of 9/9.  I guess I'll mark the topic as "on hold" for
> > > now before doing anything, as I am officially taking a time-off most
> > > of this week ;-)
> >
> > So..., that week has passed---anything new?
>
> Unfortunately, no.
>
> If you want I can send an incremental change on the content of 9/9 on
> top of what's in next. Otherwise I can't see what I could do on this.
>
> Peff, could you tell us if you might have time to take a look at this soon?

Chris, correct me if I am wrong, but was it not your decision to
contribute these patches? Are you saying that you do not understand them
well enough to drive this patch series forward (e.g. address all reviews
and questions) and are basically trying to force Peff to contribute them
instead?

Ciao,
Johannes
Christian Couder Dec. 8, 2019, 7:53 a.m. UTC | #7
Hi Johannes,

On Sat, Dec 7, 2019 at 9:47 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> On Sat, 7 Dec 2019, Christian Couder wrote:
>
> > On Fri, Dec 6, 2019 at 10:42 PM Junio C Hamano <gitster@pobox.com> wrote:
> > >
> > > Junio C Hamano <gitster@pobox.com> writes:
> > >
> > > > Junio C Hamano <gitster@pobox.com> writes:
> > > >
> > > >> Jonathan Tan <jonathantanmy@google.com> writes:
> > > >>
> > > >>>> It could be a good idea if Peff could answer some of the comments made
> > > >>>> by Jonathan Tan about patch 9/9.
> > > >>>>
> > > >>>> I have put Peff as the author of all the commits.
> > > >>>
> > > >>> Thanks. I think the series looks mostly good except for the questions I
> > > >>> raised in patch 9/9, so I'll wait for Peff to respond too.
> > > >>
> > > >> Hmph, the round before this one has been in 'next' for quite a
> > > >> while, so should I eject it before waiting for Peff to respond
> > > >> before queuing this one?
> > > >
> > > > After rebasing these v3 patches on top of the base of the one in
> > > > 'next', the only difference seems to be the log message of 3/9 and
> > > > the contents of 9/9.  I guess I'll mark the topic as "on hold" for
> > > > now before doing anything, as I am officially taking a time-off most
> > > > of this week ;-)
> > >
> > > So..., that week has passed---anything new?
> >
> > Unfortunately, no.
> >
> > If you want I can send an incremental change on the content of 9/9 on
> > top of what's in next. Otherwise I can't see what I could do on this.
> >
> > Peff, could you tell us if you might have time to take a look at this soon?
>
> Chris, correct me if I am wrong, but was it not your decision to
> contribute these patches?

Please take a look at:

https://public-inbox.org/git/3E56B0FD-EBE8-4057-A93A-16EBB09FBCE0@jramsay.com.au/

and Peff's response to James Ramsay's email.

Peff wrote:

> It's been on my todo list to upstream for a while, but I've dragged my
> feet on it because there's a lot of cleanup/polishing from the original
> patches (they were never very clean in the first place, and we've merged
> a dozen or more times with upstream since then, so the updates are
> spread across a bunch of merge commits).

and then:

> Yeah, I think we should work on getting our changes (including those
> stats) into upstream.

So actually I thought that I was helping Peff on this, though I know
of course that it's also helping GitLab and everyone else. That's why
I put Peff as the author of the patches.

> Are you saying that you do not understand them
> well enough to drive this patch series forward (e.g. address all reviews
> and questions) and are basically trying to force Peff to contribute them
> instead?

Yeah, I don't understand them well enough to answer Jonathan Tan's questions.

But no I am not trying to force Peff. I am trying to work with him.
When he said he thought we should work on getting the change into
upstream, I just thought he meant it and would be willing to help.





> Ciao,
> Johannes
Johannes Schindelin Dec. 8, 2019, 8:54 a.m. UTC | #8
Hi Chris,

On Sun, 8 Dec 2019, Christian Couder wrote:

> Peff wrote:
>
> > It's been on my todo list to upstream for a while, but I've dragged my
> > feet on it because there's a lot of cleanup/polishing from the original
> > patches (they were never very clean in the first place, and we've merged
> > a dozen or more times with upstream since then, so the updates are
> > spread across a bunch of merge commits).
>
> and then:
>
> > Yeah, I think we should work on getting our changes (including those
> > stats) into upstream.
>
> So actually I thought that I was helping Peff on this, though I know
> of course that it's also helping GitLab and everyone else.

In my experience, sending the initial set of patches is the easiest part
of contributing patches, by far. The most involved part of the process is
to react to reviewer comments and to prepare new iterations.

You can see this challenge in action in all the Git for Windows
patches/patch series I am "upstreaming".

So actually I think that you are doing a disservice to Peff: if he had
time for that tedious part of the patch contribution, I am sure it would
have been no burden at all to send the initial set of patches.

> That's why I put Peff as the author of the patches.

No, that is not the reason. You might think that that is the reason, but
the real reason why Peff is marked as the author of those patches is that
he really authored those patches.

In light of what you said, I don't think that it is a good idea to go
forward by leaning even further on Peff. From his activity on the Git
mailing list, I deduce that he is not exactly in need of even more work.

Instead, I think that if you truly want to push these patches forward, you
will have to dig deeper yourself, and answer Jonathan Tan's questions, and
possibly adjust the patches accordingly and send a new iteration.

I perceive it as very unfair toward Peff that this has not yet happened.

Ciao,
Johannes
Christian Couder Dec. 8, 2019, 10:26 a.m. UTC | #9
Hi Dscho,

On Sun, Dec 8, 2019 at 9:54 AM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> On Sun, 8 Dec 2019, Christian Couder wrote:
>
> > Peff wrote:
> >
> > > It's been on my todo list to upstream for a while, but I've dragged my
> > > feet on it because there's a lot of cleanup/polishing from the original
> > > patches (they were never very clean in the first place, and we've merged
> > > a dozen or more times with upstream since then, so the updates are
> > > spread across a bunch of merge commits).
> >
> > and then:
> >
> > > Yeah, I think we should work on getting our changes (including those
> > > stats) into upstream.
> >
> > So actually I thought that I was helping Peff on this, though I know
> > of course that it's also helping GitLab and everyone else.
>
> In my experience, sending the initial set of patches is the easiest part
> of contributing patches, by far. The most involved part of the process is
> to react to reviewer comments and to prepare new iterations.
>
> You can see this challenge in action in all the Git for Windows
> patches/patch series I am "upstreaming".
>
> So actually I think that you are doing a disservice to Peff: if he had
> time for that tedious part of the patch contribution, I am sure it would
> have been no burden at all to send the initial set of patches.

I think Peff can say by himself what he thinks about my work when I
rework the raw patches he sends to help get them upstreamed.

It's not the first time that I have done that and every time I have
done it, I think he has found it useful. Even this time he also wrote
that my work has been useful.

> > That's why I put Peff as the author of the patches.
>
> No, that is not the reason. You might think that that is the reason, but
> the real reason why Peff is marked as the author of those patches is that
> he really authored those patches.

That doesn't contradict at all what I am saying. I am saying that I
kept Peff as the author because I am just helping him, which means
that I actually acknowledge that he really authored those patches, no?

> In light of what you said, I don't think that it is a good idea to go
> forward by leaning even further on Peff. From his activity on the Git
> mailing list, I deduce that he is not exactly in need of even more work.

I think it's ok to ping people, even many times, when they have said
that they want to work on something but for some reason don't do it.
That's what Junio did by the way too. Junio just pinged everyone
involved, and then I pinged Peff specifically as I think the part of
the work left is more his than mine.

If Peff had said that he doesn't want to work on this any more then I
wouldn't ping him, and I would perhaps try something else, like just
ask for only the first patch (1/9) to be merged.

> Instead, I think that if you truly want to push these patches forward, you
> will have to dig deeper yourself, and answer Jonathan Tan's questions, and
> possibly adjust the patches accordingly and send a new iteration.

I think that it's ok to ping Peff until he says that he doesn't or
cannot for some reason work on this anymore. This shouldn't be a big
burden for him to say that, no?

> I perceive it as very unfair toward Peff that this has not yet happened.

I perceive it as unfair to me that you think that I have to do a lot
of work on this when Peff hasn't even said that he doesn't want to, or
cannot, answer Jonathan's question.

Best,
Christian.
Johannes Schindelin Dec. 8, 2019, 10:45 a.m. UTC | #10
Hi Chris,

On Sun, 8 Dec 2019, Christian Couder wrote:

> I perceive it as unfair to me that you think that I have to do a lot
> of work on this when Peff hasn't even said that he doesn't want to, or
> cannot, answer Jonathan's question.

Well, you have time enough to send lengthy replies on a Sunday morning
(while Peff apparently did not even have time to say that he lacks the
time to work on this).

So there,
Johannes
Jeff King Dec. 9, 2019, 6:18 a.m. UTC | #11
On Sun, Dec 08, 2019 at 09:54:01AM +0100, Johannes Schindelin wrote:

> > That's why I put Peff as the author of the patches.
> 
> No, that is not the reason. You might think that that is the reason, but
> the real reason why Peff is marked as the author of those patches is that
> he really authored those patches.
> 
> In light of what you said, I don't think that it is a good idea to go
> forward by leaning even further on Peff. From his activity on the Git
> mailing list, I deduce that he is not exactly in need of even more work.
> 
> Instead, I think that if you truly want to push these patches forward, you
> will have to dig deeper yourself, and answer Jonathan Tan's questions, and
> possibly adjust the patches accordingly and send a new iteration.
> 
> I perceive it as very unfair toward Peff that this has not yet happened.

To be clear, I am not bothered by this. And in fact I feel bad that I
promised Christian that I take a careful look at the patches again, but
haven't gotten around to it (for an embarrassingly long time now).

Now I would _love_ if somebody else dug into the topic enough to
understand all of the ins and outs, and whether what they're doing is
sane (or could be done better). But barring that, these patches have
been battle-tested for many years on GitHub's servers, so even if we
just take them as-is I hope it would be an improvement.

Fortunately I have some other work to do that I would like very much to
procrastinate on, so let me see if that can summon the willpower for me
to review these.

> Well, you have time enough to send lengthy replies on a Sunday morning
> (while Peff apparently did not even have time to say that he lacks the
> time to work on this).

One tricky thing here is that I leave messages or subthreads that I
intend to act on in my incoming Git mbox. And of course as time goes on,
those get pushed further back in the pile. But when new messages arrive,
mutt attaches them to the old threads, and I sometimes don't see them
(until I go back and sift through the pile).

I wish there was a good way to have mutt remain in threaded mode, but
sort the threads by recent activity. Setting sort_aux=last-date kind of
works, but last time I tried it, I got annoyed that it did funny things
with the order of patches within a thread (if somebody replies to patch
3/5, and then 2/5, it will pull 3/5 down as "more recent").

Dscho, you may feel free to roll your eyes and mutter under your breath
about email if you wish. ;)

-Peff
Johannes Schindelin Dec. 9, 2019, 9:28 a.m. UTC | #12
Hi Peff,

On Mon, 9 Dec 2019, Jeff King wrote:

> On Sun, Dec 08, 2019 at 09:54:01AM +0100, Johannes Schindelin wrote:
>
> > > That's why I put Peff as the author of the patches.
> >
> > No, that is not the reason. You might think that that is the reason, but
> > the real reason why Peff is marked as the author of those patches is that
> > he really authored those patches.
> >
> > In light of what you said, I don't think that it is a good idea to go
> > forward by leaning even further on Peff. From his activity on the Git
> > mailing list, I deduce that he is not exactly in need of even more work.
> >
> > Instead, I think that if you truly want to push these patches forward, you
> > will have to dig deeper yourself, and answer Jonathan Tan's questions, and
> > possibly adjust the patches accordingly and send a new iteration.
> >
> > I perceive it as very unfair toward Peff that this has not yet happened.
>
> To be clear, I am not bothered by this. And in fact I feel bad that I
> promised Christian that I take a careful look at the patches again, but
> haven't gotten around to it (for an embarrassingly long time now).
>
> Now I would _love_ if somebody else dug into the topic enough to
> understand all of the ins and outs, and whether what they're doing is
> sane (or could be done better).

That's what I thought.

When I bring patches to the Git mailing list, it means implicitly not only
that I understand the ins and outs of them, but also that I am fully
prepared to address reviewer comments and send out new, enhanced
iterations.

That holds when I send patch series that include patches authored by
someone else than me. I thought that that is kind of expected, otherwise
there would be no good reason for _me_ to send those patches, right?

> But barring that, these patches have been battle-tested for many years
> on GitHub's servers, so even if we just take them as-is I hope it would
> be an improvement.
>
> Fortunately I have some other work to do that I would like very much to
> procrastinate on, so let me see if that can summon the willpower for me
> to review these.

Heh, I know that feeling.

> > Well, you have time enough to send lengthy replies on a Sunday morning
> > (while Peff apparently did not even have time to say that he lacks the
> > time to work on this).
>
> One tricky thing here is that I leave messages or subthreads that I
> intend to act on in my incoming Git mbox. And of course as time goes on,
> those get pushed further back in the pile. But when new messages arrive,
> mutt attaches them to the old threads, and I sometimes don't see them
> (until I go back and sift through the pile).
>
> I wish there was a good way to have mutt remain in threaded mode, but
> sort the threads by recent activity. Setting sort_aux=last-date kind of
> works, but last time I tried it, I got annoyed that it did funny things
> with the order of patches within a thread (if somebody replies to patch
> 3/5, and then 2/5, it will pull 3/5 down as "more recent").

When I hit such a situation, I usually go on this kind of insane
side-track to figure out whether it would be easy to fix this (it's open
source, after all). Last time I tried such a thing, though, I had to admit
that it was not easy (but I use Alpine, not mutt, out of sheer inability
to adjust my muscle memory).

> Dscho, you may feel free to roll your eyes and mutter under your breath
> about email if you wish. ;)

Done.

;-)

Ciao,
Dscho
Junio C Hamano Dec. 9, 2019, 7 p.m. UTC | #13
Jeff King <peff@peff.net> writes:

> Now I would _love_ if somebody else dug into the topic enough to
> understand all of the ins and outs, and whether what they're doing is
> sane (or could be done better). But barring that, these patches have
> been battle-tested for many years on GitHub's servers, so even if we
> just take them as-is I hope it would be an improvement.

Sorry; it wasn't my intention to ask accelerating the topic
forward---it was just to see the current status to adjust my
expectations.  And you could have stopped at the end of this
paragraph but ...

> Fortunately I have some other work to do that I would like very much to
> procrastinate on, so let me see if that can summon the willpower for me
> to review these.

... I'd love to see us benefit from this ;-)
Junio C Hamano Dec. 9, 2019, 7:05 p.m. UTC | #14
Jeff King <peff@peff.net> writes:

> One tricky thing here is that I leave messages or subthreads that I
> intend to act on in my incoming Git mbox. And of course as time goes on,
> those get pushed further back in the pile. But when new messages arrive,
> mutt attaches them to the old threads, and I sometimes don't see them
> (until I go back and sift through the pile).

This is why I still let a tab in my browser to be squat by GMail
that shows only the traffic sent to this mailing list, as it is very
good at surfacing a thread with new activity even though it is bad
at everything else, including threading.

For real work, picking messages up and responding to them, I read
and handle the list traffic via NNTP interface to public-inbox (and
lore.k.o these days), but keeping GMail purely as a notification
channel has its uses ;-).