mbox series

[0/5] tree-walk object_id refactor

Message ID 20190110042551.915769-1-sandals@crustytoothpaste.net (mailing list archive)
Headers show
Series tree-walk object_id refactor | expand

Message

brian m. carlson Jan. 10, 2019, 4:25 a.m. UTC
There are a small number of places in our codebase where we cast a
buffer of unsigned char to a struct object_id pointer. When we have
GIT_MAX_RAWSZ set to 32 (because we have SHA-256), one of these places
(the buffer for tree objects) can lead to us copying too much data when
using SHA-1 as the hash, since there are only 20 bytes to read.

This was not expected to be a problem before future code was introduced,
but due to a combination of series the issue became noticeable.

This series introduces a refactor to avoid referencing the struct
object_id directly from a buffer and instead storing an additional
struct object_id (and an int) in struct name_entry and referring to
that.

This series, while based on master, addresses the interactions seen on
pu between the SHA-256 series and the oidset series. There are a small
number of conflicts, both textual and logical, when merging this series
and pu, but they should be easily resolved.

This series contains a final patch which will become necessary at some
point for hygienic code, but which could be deferred until later if
desired.

The testsuite passes with AddressSanitizer at each stage and when merged
into pu.

brian m. carlson (5):
  tree-walk: copy object ID before use
  match-trees: compute buffer offset correctly when splicing
  match-trees: use hashcpy to splice trees
  tree-walk: store object_id in a separate member
  cache: make oidcpy always copy GIT_MAX_RAWSZ bytes

 builtin/grep.c                     |  8 ++++----
 builtin/merge-tree.c               | 20 ++++++++++----------
 builtin/pack-objects.c             |  4 ++--
 builtin/reflog.c                   |  4 ++--
 cache-tree.c                       |  4 ++--
 cache.h                            |  2 +-
 contrib/coccinelle/object_id.cocci | 30 ------------------------------
 delta-islands.c                    |  2 +-
 fsck.c                             |  4 ++--
 http-push.c                        |  4 ++--
 list-objects.c                     |  6 +++---
 match-trees.c                      | 11 ++++++-----
 notes.c                            |  4 ++--
 packfile.c                         |  2 +-
 revision.c                         |  4 ++--
 tree-diff.c                        |  6 +++---
 tree-walk.c                        | 21 ++++++++++++---------
 tree-walk.h                        |  9 ++++++---
 tree.c                             | 10 +++++-----
 unpack-trees.c                     |  6 +++---
 walker.c                           |  4 ++--
 21 files changed, 71 insertions(+), 94 deletions(-)

Comments

Jeff King Jan. 10, 2019, 6:40 a.m. UTC | #1
On Thu, Jan 10, 2019 at 04:25:46AM +0000, brian m. carlson wrote:

> There are a small number of places in our codebase where we cast a
> buffer of unsigned char to a struct object_id pointer. When we have
> GIT_MAX_RAWSZ set to 32 (because we have SHA-256), one of these places
> (the buffer for tree objects) can lead to us copying too much data when
> using SHA-1 as the hash, since there are only 20 bytes to read.
> 
> This was not expected to be a problem before future code was introduced,
> but due to a combination of series the issue became noticeable.
> 
> This series introduces a refactor to avoid referencing the struct
> object_id directly from a buffer and instead storing an additional
> struct object_id (and an int) in struct name_entry and referring to
> that.

I think this is really the only safe and sane solution. We resisted it
because of the cost of the extra copies (especially the
update_tree_entry() one). But I don't know that anybody actually
measured it. Do you have any performance numbers before/after this
series?

-Peff
brian m. carlson Jan. 11, 2019, 12:17 a.m. UTC | #2
On Thu, Jan 10, 2019 at 01:40:31AM -0500, Jeff King wrote:
> On Thu, Jan 10, 2019 at 04:25:46AM +0000, brian m. carlson wrote:
> 
> > There are a small number of places in our codebase where we cast a
> > buffer of unsigned char to a struct object_id pointer. When we have
> > GIT_MAX_RAWSZ set to 32 (because we have SHA-256), one of these places
> > (the buffer for tree objects) can lead to us copying too much data when
> > using SHA-1 as the hash, since there are only 20 bytes to read.
> > 
> > This was not expected to be a problem before future code was introduced,
> > but due to a combination of series the issue became noticeable.
> > 
> > This series introduces a refactor to avoid referencing the struct
> > object_id directly from a buffer and instead storing an additional
> > struct object_id (and an int) in struct name_entry and referring to
> > that.
> 
> I think this is really the only safe and sane solution. We resisted it
> because of the cost of the extra copies (especially the
> update_tree_entry() one). But I don't know that anybody actually
> measured it. Do you have any performance numbers before/after this
> series?

Unfortunately, I don't. I'm not really sure in what situations we hit
this code path a lot, so I'm not sure what exactly we should performance
test. If you have suggestions, I can set up some perf tests.

I will say that I resisted writing this series for a long time, since I
knew it was going to be a difficult slog to get everything working (and
it was). If there had been a way to avoid it, I would have done it.
However, writing this series led to about ten tests in my SHA-256 work
suddenly passing where they hadn't before.
Jeff King Jan. 11, 2019, 2:17 p.m. UTC | #3
On Fri, Jan 11, 2019 at 12:17:50AM +0000, brian m. carlson wrote:

> > I think this is really the only safe and sane solution. We resisted it
> > because of the cost of the extra copies (especially the
> > update_tree_entry() one). But I don't know that anybody actually
> > measured it. Do you have any performance numbers before/after this
> > series?
> 
> Unfortunately, I don't. I'm not really sure in what situations we hit
> this code path a lot, so I'm not sure what exactly we should performance
> test. If you have suggestions, I can set up some perf tests.

I think the interesting one here is the copy in update_tree_entry(),
which is called for every entry of every tree we parse for a diff. So
the command with the most noticeable impact, I think, would be something
like:

  git log -- pathspec

I couldn't demonstrate any measurable slowdown (I didn't compute the
mean to see if it gets worse, but certainly any change is within the
run-to-run noise).

I guess that is competing with the cost to access the commit objects.
Perhaps a pure tree diff would be a more precise test. Here's the most
pathological case I could come up with:

  git init
  for i in $(seq 10000); do echo $i >$i; done
  git add .
  git commit -m base
  echo change >9999
  git commit -am change
  time git diff-tree HEAD^ HEAD

which should really be spending 99% of its time inflating and walking
through the trees. And even there I couldn't measure anything.

So I'd say it's probably fine.

-Peff