[4/5] strmap: add strdup_strings option

Message ID	b3095d97d8ee9d6576292731cc100492e7c64f13.1598035949.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=jbtA=B7=vger.kernel.org=git-owner@kernel.org> Message-Id: <b3095d97d8ee9d6576292731cc100492e7c64f13.1598035949.git.gitgitgadget@gmail.com> In-Reply-To: <pull.835.git.git.1598035949.gitgitgadget@gmail.com> References: <pull.835.git.git.1598035949.gitgitgadget@gmail.com> From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> Date: Fri, 21 Aug 2020 18:52:28 +0000 Subject: [PATCH 4/5] strmap: add strdup_strings option Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Jeff King <peff@peff.net>, Elijah Newren <newren@gmail.com>, Elijah Newren <newren@gmail.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Add struct strmap and associated utility functions \| expand [0/5] Add struct strmap and associated utility functions [1/5] hashmap: add usage documentation explaining hashmap_free[_entries]() [2/5] strmap: new utility functions [3/5] strmap: add more utility functions [4/5] strmap: add strdup_strings option [5/5] strmap: add functions facilitating use as a string->int map [v2,06/10] strmap: add more utility functions [v2,07/10] strmap: enable faster clearing and reusing of strmaps [v2,08/10] strmap: add functions facilitating use as a string->int map [v2,09/10] strmap: add a strset sub-type [v2,10/10] strmap: enable allocations to come from a mem_pool

Linus Arver via GitGitGadget Aug. 21, 2020, 6:52 p.m. UTC

From: Elijah Newren <newren@gmail.com>

Just as it is sometimes useful for string_list to duplicate and take
ownership of memory management of the strings it contains, the same is
sometimes true for strmaps as well.  Add the same flag from string_list
to strmap.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 strmap.c | 23 ++++++++++++++++-------
 strmap.h |  9 +++++----
 2 files changed, 21 insertions(+), 11 deletions(-)

Jeff King Aug. 21, 2020, 8:01 p.m. UTC | #1

On Fri, Aug 21, 2020 at 06:52:28PM +0000, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>
> 
> Just as it is sometimes useful for string_list to duplicate and take
> ownership of memory management of the strings it contains, the same is
> sometimes true for strmaps as well.  Add the same flag from string_list
> to strmap.

This is actually one of the ugliest parts of string_list, IMHO, and I'd
prefer if we can avoid duplicating it. Yes, sometimes we can manage to
avoid an extra copy of a string. But the resulting ownership and
lifetime questions are often very error-prone. In other data structures
we've moved towards just having the structure own its data (e.g.,
strvec does so, and things like oidmap store their own oids). I've been
happy with the simplicity of it.

It also works if you use a flex-array for the key storage in the
strmap_entry. :)

-Peff

Elijah Newren Aug. 21, 2020, 8:41 p.m. UTC | #2

On Fri, Aug 21, 2020 at 1:01 PM Jeff King <peff@peff.net> wrote:
>
> On Fri, Aug 21, 2020 at 06:52:28PM +0000, Elijah Newren via GitGitGadget wrote:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > Just as it is sometimes useful for string_list to duplicate and take
> > ownership of memory management of the strings it contains, the same is
> > sometimes true for strmaps as well.  Add the same flag from string_list
> > to strmap.
>
> This is actually one of the ugliest parts of string_list, IMHO, and I'd
> prefer if we can avoid duplicating it. Yes, sometimes we can manage to
> avoid an extra copy of a string. But the resulting ownership and
> lifetime questions are often very error-prone. In other data structures
> we've moved towards just having the structure own its data (e.g.,
> strvec does so, and things like oidmap store their own oids). I've been
> happy with the simplicity of it.
>
> It also works if you use a flex-array for the key storage in the
> strmap_entry. :)

I can see how it's easier, but that worries me about the number of
extra copies for my usecase.  In order to minimize actual computation,
I track an awful lot of auxiliary data in merge-ort so that I know
when I can safely perform many different case-specific optimizations.
Among other things, this means 15 strmaps.  1 of those stores a
mapping from all paths that traverse_trees() walks over (file or
directory) to metadata about the content on the three different sides.
9 of the remaining 14 simply share the strings in the main strmap,
because I don't need extra copies of the paths in the repository.  I
could (and maybe should) extend that to 11 of the 14.  Only 3 actually
do need to store a copy of the paths (because they store data used
beyond the end of an inner recursive merge or can be used to
accelerate subsequent commits in a rebase or cherry-pick sequence).

So, in most my cases, I don't want to duplicate strings.  I actually
started my implementation using FLEX_ALLOC_STR(), as you suggested
earlier in this thread, but tossed it because of this same desire to
not duplicate strings but just share them between the strmaps.

Granted, I made that decision before I had a complete implementation,
so I didn't measure the actual costs.  It's possible that was a
premature optimization.

Jeff King Aug. 21, 2020, 9:03 p.m. UTC | #3

On Fri, Aug 21, 2020 at 01:41:44PM -0700, Elijah Newren wrote:

> > This is actually one of the ugliest parts of string_list, IMHO, and I'd
> > prefer if we can avoid duplicating it. Yes, sometimes we can manage to
> > avoid an extra copy of a string. But the resulting ownership and
> > lifetime questions are often very error-prone. In other data structures
> > we've moved towards just having the structure own its data (e.g.,
> > strvec does so, and things like oidmap store their own oids). I've been
> > happy with the simplicity of it.
> >
> > It also works if you use a flex-array for the key storage in the
> > strmap_entry. :)
> 
> I can see how it's easier, but that worries me about the number of
> extra copies for my usecase.  In order to minimize actual computation,
> I track an awful lot of auxiliary data in merge-ort so that I know
> when I can safely perform many different case-specific optimizations.
> Among other things, this means 15 strmaps.  1 of those stores a
> mapping from all paths that traverse_trees() walks over (file or
> directory) to metadata about the content on the three different sides.
> 9 of the remaining 14 simply share the strings in the main strmap,
> because I don't need extra copies of the paths in the repository.  I
> could (and maybe should) extend that to 11 of the 14.  Only 3 actually
> do need to store a copy of the paths (because they store data used
> beyond the end of an inner recursive merge or can be used to
> accelerate subsequent commits in a rebase or cherry-pick sequence).

I'd have to see the code, of course, but:

  - keep in mind you're allocating 8 bytes for a pointer (plus 24 for
    the rest of the strmap entry). If you use a flex-array you get those
    8 bytes back. Full paths do tend to be longer than that, so it's
    probably net worse than a pointer to an existing string. But how
    much worse, and does it matter?

  - That sounds like a lot of maps. :) I guess you've looked at
    compacting some of them into a single map-to-struct?

> So, in most my cases, I don't want to duplicate strings.  I actually
> started my implementation using FLEX_ALLOC_STR(), as you suggested
> earlier in this thread, but tossed it because of this same desire to
> not duplicate strings but just share them between the strmaps.
> 
> Granted, I made that decision before I had a complete implementation,
> so I didn't measure the actual costs.  It's possible that was a
> premature optimization.

I'm just really concerned that it poisons the data structure with
complexity that many of the other callers will have to deal with. We've
had several "oops, strdup_strings wasn't what I expected it to be" bugs
with string-list (in both directions: leaks and use-after-free). It
would be nice to have actual numbers and see if it's worth the cost.

-Peff

Elijah Newren Aug. 21, 2020, 10:25 p.m. UTC | #4

On Fri, Aug 21, 2020 at 2:03 PM Jeff King <peff@peff.net> wrote:
>
> On Fri, Aug 21, 2020 at 01:41:44PM -0700, Elijah Newren wrote:
>
> > > This is actually one of the ugliest parts of string_list, IMHO, and I'd
> > > prefer if we can avoid duplicating it. Yes, sometimes we can manage to
> > > avoid an extra copy of a string. But the resulting ownership and
> > > lifetime questions are often very error-prone. In other data structures
> > > we've moved towards just having the structure own its data (e.g.,
> > > strvec does so, and things like oidmap store their own oids). I've been
> > > happy with the simplicity of it.
> > >
> > > It also works if you use a flex-array for the key storage in the
> > > strmap_entry. :)
> >
> > I can see how it's easier, but that worries me about the number of
> > extra copies for my usecase.  In order to minimize actual computation,
> > I track an awful lot of auxiliary data in merge-ort so that I know
> > when I can safely perform many different case-specific optimizations.
> > Among other things, this means 15 strmaps.  1 of those stores a
> > mapping from all paths that traverse_trees() walks over (file or
> > directory) to metadata about the content on the three different sides.
> > 9 of the remaining 14 simply share the strings in the main strmap,
> > because I don't need extra copies of the paths in the repository.  I
> > could (and maybe should) extend that to 11 of the 14.  Only 3 actually
> > do need to store a copy of the paths (because they store data used
> > beyond the end of an inner recursive merge or can be used to
> > accelerate subsequent commits in a rebase or cherry-pick sequence).
>
> I'd have to see the code, of course, but:

>   - keep in mind you're allocating 8 bytes for a pointer (plus 24 for
>     the rest of the strmap entry). If you use a flex-array you get those
>     8 bytes back. Full paths do tend to be longer than that, so it's
>     probably net worse than a pointer to an existing string. But how
>     much worse, and does it matter?

I'll investigate; it may take a while...

>   - That sounds like a lot of maps. :) I guess you've looked at
>     compacting some of them into a single map-to-struct?

Oh, map-to-struct is the primary use.  But compacting them won't work,
because the reason for the additional maps is that they have different
sets of keys (this set of paths meet a certain condition...).  Only
one map contains all the paths involved in the merge.

Also, several of those maps don't even store a value; and are really
just a set implemented via strmap (thus meaning the only bit of data I
need for some conditions is whether any given path meets it).  It
seems slightly ugly to have to call strmap_put(map, string, NULL) for
those.  I wonder if I should have another strset type much like your
suggesting for strintmap.  Hmm...

Also, one thing that inflates the number of strmaps I use is that
several of those conditions are specific to a certain side of the
merge, thus requiring two strmaps for each of those special
conditions.

> > So, in most my cases, I don't want to duplicate strings.  I actually
> > started my implementation using FLEX_ALLOC_STR(), as you suggested
> > earlier in this thread, but tossed it because of this same desire to
> > not duplicate strings but just share them between the strmaps.
> >
> > Granted, I made that decision before I had a complete implementation,
> > so I didn't measure the actual costs.  It's possible that was a
> > premature optimization.
>
> I'm just really concerned that it poisons the data structure with
> complexity that many of the other callers will have to deal with. We've
> had several "oops, strdup_strings wasn't what I expected it to be" bugs
> with string-list (in both directions: leaks and use-after-free). It
> would be nice to have actual numbers and see if it's worth the cost.

I'll go get some and find out what the impact is.

Jeff King Aug. 28, 2020, 7:08 a.m. UTC | #5

On Fri, Aug 21, 2020 at 03:25:44PM -0700, Elijah Newren wrote:

> >   - That sounds like a lot of maps. :) I guess you've looked at
> >     compacting some of them into a single map-to-struct?
> 
> Oh, map-to-struct is the primary use.  But compacting them won't work,
> because the reason for the additional maps is that they have different
> sets of keys (this set of paths meet a certain condition...).  Only
> one map contains all the paths involved in the merge.

OK, I guess I'm not surprised that you would not have missed such an
obvious optimization. :)

> Also, several of those maps don't even store a value; and are really
> just a set implemented via strmap (thus meaning the only bit of data I
> need for some conditions is whether any given path meets it).  It
> seems slightly ugly to have to call strmap_put(map, string, NULL) for
> those.  I wonder if I should have another strset type much like your
> suggesting for strintmap.  Hmm...

FWIW, khash does have a "set" mode where it avoids allocating the value
array at all.

What's the easiest way to benchmark merge-ort? I suspect I could swap
out hashmap for khash (messily) in an hour or less.

-Peff

Elijah Newren Aug. 28, 2020, 5:20 p.m. UTC | #6

On Fri, Aug 28, 2020 at 12:08 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Aug 21, 2020 at 03:25:44PM -0700, Elijah Newren wrote:
>
> > >   - That sounds like a lot of maps. :) I guess you've looked at
> > >     compacting some of them into a single map-to-struct?
> >
> > Oh, map-to-struct is the primary use.  But compacting them won't work,
> > because the reason for the additional maps is that they have different
> > sets of keys (this set of paths meet a certain condition...).  Only
> > one map contains all the paths involved in the merge.
>
> OK, I guess I'm not surprised that you would not have missed such an
> obvious optimization. :)
>
> > Also, several of those maps don't even store a value; and are really
> > just a set implemented via strmap (thus meaning the only bit of data I
> > need for some conditions is whether any given path meets it).  It
> > seems slightly ugly to have to call strmap_put(map, string, NULL) for
> > those.  I wonder if I should have another strset type much like your
> > suggesting for strintmap.  Hmm...
>
> FWIW, khash does have a "set" mode where it avoids allocating the value
> array at all.

Cool.

> What's the easiest way to benchmark merge-ort?

Note that I discovered another optimization that I'm working on
implementing; when finished, it should cut down a little more on the
time spent on inexact rename detection.  That should have the side
effect of having the time spent on strmaps stick out some more in the
overall timings (as a percentage of overall time anyway).  So, I'm
focused on that before I do other benchmarking work (which is part of
the reason I mentioned my strmap/hashmap benchmarking last week might
take a while).

Anyway, on to your question:

=== If you just want to be able to run the ort merge algorithm ===

Clone git@github.com:newren/git and checkout the 'ort' branch and
build it.  It currently changes the default merge algorithm to 'ort'
and even ignores '-s recursive' by remapping it to '-s ort' (because I
wanted to see how regression tests fared with ort as a replacement for
recrusive).  It should pass the regression tests if you want to run
those first.  But note that if you want to compare 'ort' to
'recursive', then currently you need to have two different git builds,
one of my branch and one with a different checkout of something else
(e.g. 2.28.0 or 'master' or whatever).

=== Decide the granularity of your timing ===

I suspect you know more than me here, but maybe my pointers are useful anyway...

Decide if you want to measure overall program runtime, or dive into
details.  I used both a simple 'time' and the better 'hyperfine' for
the former, and used both 'perf' and GIT_TRACE2_PERF for the latter.
One nice thing about GIT_TRACE2_PERF was I wrote a simple program to
aggregate the times per region and provide percentages, in a script at
the toplevel named 'summarize-perf' that I can use to prefix commands.
Thus, I could for example run from my linux clone:
    $ ../git/summarize-perf git fast-rebase --onto HEAD base hwmon-updates
and I'd get output that looks something like (note that this is a
subset of the real output):
    1.400 : 35 : label:inmemory_nonrecursive
       0.827 : 41 : ..label:renames
          0.019 : <unmeasured> ( 2.2%)
          0.803 : 37 : ....label:regular renames
          0.004 : 31 : ....label:directory renames
          0.001 : 31 : ....label:process renames
       0.513 : 41 : ..label:collect_merge_info
       0.048 : 35 : ..label:process_entries
    0.117 : 1 : label:checkout
    0.000 : 1 : label:record_unmerged
and where those fields are <time> : <count> : <region label>.

=== If you want to time the testcases I used heavily while developing ===

The rebase-testcase/redo-timings script (in the ort branch) has
details on what I actually ran, though it has some paranoia around
attempting to make my laptop run semi-quietly and try to avoid all the
variance that I wished I could control a bit better.  And it assumes
you are running in a linux clone with a few branches set up a certain
way.  Let me explain those tests without using that script, as simply
as I can:

The setup for the particular cases I was testing is as follows:
  * Clone the linux kernel, and run the following:
  $ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
  $ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
  $ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
  $ git switch -c 5.4-renames v5.4
  $ git mv drivers pilots
  $ git commit -m "Rename drivers/ to pilots/"

And from there, there were three primary tests I was comparing:

  * Rename testcase, 35 patches:
  $ git checkout 5.4-renames^0
  $ git fast-rebase --onto HEAD base hwmon-updates

  * Rename testcase, just 1 patch:
  $ git switch 5.4-renames^0
  $ git fast-rebase --onto HEAD base hwmon-just-one

  * No renames (or at least very few renames) testcase, 35 patches:
  $ git checkout v5.4^0
  $ git branch -f hwmon-updates
fd8bdb23b91876ac1e624337bb88dc1dcc21d67e # Need to reset
hwmon-updates, due to fast-rebase done above
  $ git fast-rebase --onto HEAD base hwmon-updates

(If you want to compare with 'recursive' from a different build of
git, just replace 'fast-rebase' with 'rebase'.  You can also use
'rebase' instead of 'fast-rebase' on the ort branch and it'll use the
ort merge algorithm, but you get all the annoying
working-tree-updates-while-rebasing rather than just having the
working tree updated at the end of the rebase.  You also get all the
annoying forks of 'git checkout' and 'git commit' that sequencer is
guilty of spawning.  But it certainly supports a lot more options and
can save state to allow resuming after conflicts, unlike
'fast-rebase'.)

> I suspect I could swap out hashmap for khash (messily) in an hour or less.

Well, you might be assuming I used sane strmaps, with each strmap
having a fixed type for the stored value.  That's mostly true, but
there were two counterexamples I can think of: "paths" (the biggest
strmap) is a map of string -> {merged_info OR conflict_info}, because
merged_info is a smaller subset of conflict_info and saves space for
each path that can be trivially merged.  Also, in diffcore-rename,
"dir_rename" starts life as a map of string -> strmap, but later
transitions to string -> string, because I'm evil and didn't set up a
temporary strmap like I probably should have.

Also, the code is littered with FIXME comments, unnecessary #ifdefs,
and is generally in need of lots of cleanup.  Sorry.

[4/5] strmap: add strdup_strings option

Commit Message

Comments

Patch