mbox series

[0/6] getenv() timing fixes

Message ID 20190111221414.GA31335@sigill.intra.peff.net (mailing list archive)
Headers show
Series getenv() timing fixes | expand

Message

Jeff King Jan. 11, 2019, 10:14 p.m. UTC
Similar to the recent:

  https://public-inbox.org/git/20190109221007.21624-1-kgybels@infogroep.be/

there are some other places where we do not follow the POSIX rule that
getenv()'s return value may be invalidated by other calls to getenv() or
setenv().

For the most part we haven't noticed because:

  - on many platforms, you can call getenv() as many times as you want.
    This changed recently in our mingw_getenv() helper, which is why
    people are noticing now.

  - calling setenv() in between _often_ works, but it depends on whether
    libc feels like it needs to reallocate memory. Which is itself
    platform specific, and even on a single platform may depend on
    things like how many environment variables you have set.

The first patch here is a problem somebody actually found in the wild.
That led me to start looking through the results of:

  git grep '= getenv('

There are a ton of hits. I poked at the first 20 or so. A lot of them
are fine, as they do something like this:

  rla = getenv("GIT_REFLOG_ACTION");
  strbuf_addstr("blah blah %s", rla);

That's not _strictly_ correct, because strbuf_addstr() may actually look
at the environment. But it works for our mingw_getenv() case, because
there we use a rotating series of buffers. So as long as it doesn't look at
30 environment variables, we're fine. And many calls fall into that
bucket (a more complicated one is get_ssh_command(), which runs a fair
bit of code while holding the pointer, but ultimately probably has a
small fixed number of opportunities to call getenv(). What is more
worrisome is code that holds a pointer across an arbitrary number of
calls (like once per diff'd file, or once per submodule, etc).

Of course it's possible for some platform libc to use a single buffer.
But in that case, I'd argue that the path of least resistance is
wrapping getenv, like I described in:

  https://public-inbox.org/git/20181025062037.GC11460@sigill.intra.peff.net/

So anyway. Here are a handful of what seem like pretty low-hanging
fruit. Beyond the first one, I'm not sure if they're triggerable, but
they're easy to fix. There are 100+ grep matches that I _didn't_ audit,
so this is by no means a complete fix. I was mostly trying to get a
sense of how painful these fixes would be.

  [1/6]: get_super_prefix(): copy getenv() result
  [2/6]: commit: copy saved getenv() result
  [3/6]: config: make a copy of $GIT_CONFIG string
  [4/6]: init: make a copy of $GIT_DIR string
  [5/6]: merge-recursive: copy $GITHEAD strings
  [6/6]: builtin_diff(): read $GIT_DIFF_OPTS closer to use

 builtin/commit.c          |  3 ++-
 builtin/config.c          |  2 +-
 builtin/init-db.c         |  6 ++++--
 builtin/merge-recursive.c | 15 ++++++++++-----
 diff.c                    |  5 ++++-
 environment.c             |  4 ++--
 6 files changed, 23 insertions(+), 12 deletions(-)

-Peff

Comments

Ævar Arnfjörð Bjarmason Jan. 12, 2019, 11:31 a.m. UTC | #1
On Fri, Jan 11 2019, Jeff King wrote:

> Similar to the recent:
>
>   https://public-inbox.org/git/20190109221007.21624-1-kgybels@infogroep.be/
>
> there are some other places where we do not follow the POSIX rule that
> getenv()'s return value may be invalidated by other calls to getenv() or
> setenv().
>
> For the most part we haven't noticed because:
>
>   - on many platforms, you can call getenv() as many times as you want.
>     This changed recently in our mingw_getenv() helper, which is why
>     people are noticing now.
>
>   - calling setenv() in between _often_ works, but it depends on whether
>     libc feels like it needs to reallocate memory. Which is itself
>     platform specific, and even on a single platform may depend on
>     things like how many environment variables you have set.
>
> The first patch here is a problem somebody actually found in the wild.
> That led me to start looking through the results of:
>
>   git grep '= getenv('
>
> There are a ton of hits. I poked at the first 20 or so. A lot of them
> are fine, as they do something like this:
>
>   rla = getenv("GIT_REFLOG_ACTION");
>   strbuf_addstr("blah blah %s", rla);
>
> That's not _strictly_ correct, because strbuf_addstr() may actually look
> at the environment. But it works for our mingw_getenv() case, because
> there we use a rotating series of buffers. So as long as it doesn't look at
> 30 environment variables, we're fine. And many calls fall into that
> bucket (a more complicated one is get_ssh_command(), which runs a fair
> bit of code while holding the pointer, but ultimately probably has a
> small fixed number of opportunities to call getenv(). What is more
> worrisome is code that holds a pointer across an arbitrary number of
> calls (like once per diff'd file, or once per submodule, etc).
>
> Of course it's possible for some platform libc to use a single buffer.
> But in that case, I'd argue that the path of least resistance is
> wrapping getenv, like I described in:
>
>   https://public-inbox.org/git/20181025062037.GC11460@sigill.intra.peff.net/
>
> So anyway. Here are a handful of what seem like pretty low-hanging
> fruit. Beyond the first one, I'm not sure if they're triggerable, but
> they're easy to fix. There are 100+ grep matches that I _didn't_ audit,
> so this is by no means a complete fix. I was mostly trying to get a
> sense of how painful these fixes would be.

I wonder, and not as "you should do this" feedback on this series, just
on future development, whether we shouldn't just make our own getenv()
wrapper for the majority of the GIT_* variables. The semantics would be
fetch value X, and if it's ever requested again return the value we
found the first time.

For some things we rely on getenv(X) -> setenv(X) -> getenv(X) returning
different values of X, e.g. in passing along config, but for
e.g. GIT_TEST_* variables we just want to check them once, and have our
own ad-hoc caches (via static variables) in a couple of places.

Maybe such an API would just loop over "environ" on startup, looking for
any GIT_* variables, i.e. called from the setup.c code.
Stefan Beller Jan. 12, 2019, 6:51 p.m. UTC | #2
> I wonder, and not as "you should do this" feedback on this series, just

There is a getenv_safe() in environment.c, but I guess a xgetenv() that
takes the same parameters as getenv() is better for ease of use.
Jeff King Jan. 15, 2019, 7:12 p.m. UTC | #3
On Sat, Jan 12, 2019 at 12:31:21PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > So anyway. Here are a handful of what seem like pretty low-hanging
> > fruit. Beyond the first one, I'm not sure if they're triggerable, but
> > they're easy to fix. There are 100+ grep matches that I _didn't_ audit,
> > so this is by no means a complete fix. I was mostly trying to get a
> > sense of how painful these fixes would be.
> 
> I wonder, and not as "you should do this" feedback on this series, just
> on future development, whether we shouldn't just make our own getenv()
> wrapper for the majority of the GIT_* variables. The semantics would be
> fetch value X, and if it's ever requested again return the value we
> found the first time.

Yeah, that thought certainly crossed my mind while looking into this.
But as you noted below, you do sometimes have to worry about
invalidating that cache. The most general solution is that you'd hook
setenv(), too. At which point you've basically just constructed a shadow
environment that has less-crappy semantics than what POSIX guarantees. ;)

Another option is to just have a getenv_safe() that always returns an
allocated string. That's less efficient in some cases, but probably not
meaningfully so (you probably shouldn't be calling getenv() in a tight
loop anyway). It does mean dealing with memory ownership, though, which
is awkward in some cases (e.g., see git_editor).

Mostly I'm worried about making a system that's opaque or easy for
people to get wrong (e.g., if our getenv() wrapper quietly caches things
but setenv() does not invalidate that cache, that's a recipe for
confusion).

> Maybe such an API would just loop over "environ" on startup, looking for
> any GIT_* variables, i.e. called from the setup.c code.

I think whatever we do could just lazy-load. There's no particular
initialization we have to do at the beginning of the program.

-Peff
Jeff King Jan. 15, 2019, 7:13 p.m. UTC | #4
On Sat, Jan 12, 2019 at 10:51:42AM -0800, Stefan Beller wrote:

> > I wonder, and not as "you should do this" feedback on this series, just
> 
> There is a getenv_safe() in environment.c, but I guess a xgetenv() that
> takes the same parameters as getenv() is better for ease of use.

Yes, but it punts on the memory ownership by stuffing everything into an
argv_array. That saves a few lines if you're going to ask for five
variables, but for a single variable it's no better than:

  char *foo = getenv_safe("FOO");

  ...use foo...

  free(foo);

-Peff
Junio C Hamano Jan. 15, 2019, 7:32 p.m. UTC | #5
Jeff King <peff@peff.net> writes:

> On Sat, Jan 12, 2019 at 10:51:42AM -0800, Stefan Beller wrote:
>
>> > I wonder, and not as "you should do this" feedback on this series, just
>> 
>> There is a getenv_safe() in environment.c, but I guess a xgetenv() that
>> takes the same parameters as getenv() is better for ease of use.
>
> Yes, but it punts on the memory ownership by stuffing everything into an
> argv_array. That saves a few lines if you're going to ask for five
> variables, but for a single variable it's no better than:
>
>   char *foo = getenv_safe("FOO");

You meant xstrdup_or_null(getenv("FOO")) here?  And did Stefan mean

	#define xgetenv(e) xstrdup_or_null(getenv(e))

?

>   ...use foo...
>
>   free(foo);
Stefan Beller Jan. 15, 2019, 7:38 p.m. UTC | #6
On Tue, Jan 15, 2019 at 11:32 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jeff King <peff@peff.net> writes:
>
> > On Sat, Jan 12, 2019 at 10:51:42AM -0800, Stefan Beller wrote:
> >
> >> > I wonder, and not as "you should do this" feedback on this series, just
> >>
> >> There is a getenv_safe() in environment.c, but I guess a xgetenv() that
> >> takes the same parameters as getenv() is better for ease of use.
> >
> > Yes, but it punts on the memory ownership by stuffing everything into an
> > argv_array. That saves a few lines if you're going to ask for five
> > variables, but for a single variable it's no better than:
> >
> >   char *foo = getenv_safe("FOO");
>
> You meant xstrdup_or_null(getenv("FOO")) here?  And did Stefan mean
>
>         #define xgetenv(e) xstrdup_or_null(getenv(e))
>
> ?

Assume I did. (I thought of it as a function effectively
adding the xstrdup_or_null)

If we go further into assuming the usage patterns of
these xgetenv calls, we might throw in an UNLEAK
as well, but that might be over board.
Jeff King Jan. 15, 2019, 7:41 p.m. UTC | #7
On Tue, Jan 15, 2019 at 11:32:56AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > On Sat, Jan 12, 2019 at 10:51:42AM -0800, Stefan Beller wrote:
> >
> >> > I wonder, and not as "you should do this" feedback on this series, just
> >> 
> >> There is a getenv_safe() in environment.c, but I guess a xgetenv() that
> >> takes the same parameters as getenv() is better for ease of use.
> >
> > Yes, but it punts on the memory ownership by stuffing everything into an
> > argv_array. That saves a few lines if you're going to ask for five
> > variables, but for a single variable it's no better than:
> >
> >   char *foo = getenv_safe("FOO");
> 
> You meant xstrdup_or_null(getenv("FOO")) here?  And did Stefan mean
> 
> 	#define xgetenv(e) xstrdup_or_null(getenv(e))
> 
> ?

Yes, I think that would be one possible implementation of a "safe"
getenv (and what I was thinking of specifically in that example).

The more involved one (that doesn't pass along memory ownership) is
something like:

  static struct hashmap env_cache;

  const char *getenv_safe(const char *name)
  {

	if (e = hashmap_get(&env_cache, name))
		return e->value;

        /* need some trickery to make sure xstrdup does not call getenv */
	e->value = xstrdup_or_null(getenv(name));
	e->name = xstrdup(name);
	hashmap_put(&env_cache, e);

	return e->value;
  }

with a matching setenv_safe() to drop the hashmap entry. Come to think
of it, this is really pretty equivalent to string-interning, which we
already have a hashmap for. I think one could argue that string
interning is basically just a controlled form of memory leaking, but
it's probably a reasonable compromise in this instance (i.e., we expect
to ask about a finite number of variables anyway; the important thing is
just that we don't leak memory for the same variable over and over).

-Peff
Jeff King Jan. 15, 2019, 7:47 p.m. UTC | #8
On Tue, Jan 15, 2019 at 02:41:42PM -0500, Jeff King wrote:

> The more involved one (that doesn't pass along memory ownership) is
> something like:
> 
>   static struct hashmap env_cache;
> 
>   const char *getenv_safe(const char *name)
>   {
> 
> 	if (e = hashmap_get(&env_cache, name))
> 		return e->value;
> 
>         /* need some trickery to make sure xstrdup does not call getenv */
> 	e->value = xstrdup_or_null(getenv(name));
> 	e->name = xstrdup(name);
> 	hashmap_put(&env_cache, e);
> 
> 	return e->value;
>   }
> 
> with a matching setenv_safe() to drop the hashmap entry. Come to think
> of it, this is really pretty equivalent to string-interning, which we
> already have a hashmap for. I think one could argue that string
> interning is basically just a controlled form of memory leaking, but
> it's probably a reasonable compromise in this instance (i.e., we expect
> to ask about a finite number of variables anyway; the important thing is
> just that we don't leak memory for the same variable over and over).

So actually, that's pretty easy to do without writing much code at all.
Something like:

  #define xgetenv(name) strintern(getenv(name))

It means we're effectively storing the environment twice in the worst
case, but that's probably not a big deal. Unless we have a loop which
does repeated setenv()/getenv() calls, the strintern hashmap can't grow
without bound.

-Peff
Junio C Hamano Jan. 15, 2019, 8:49 p.m. UTC | #9
Jeff King <peff@peff.net> writes:

> So actually, that's pretty easy to do without writing much code at all.
> Something like:
>
>   #define xgetenv(name) strintern(getenv(name))
>
> It means we're effectively storing the environment twice in the worst
> case, but that's probably not a big deal. Unless we have a loop which
> does repeated setenv()/getenv() calls, the strintern hashmap can't grow
> without bound.

Makes sense.



 hashmap.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hashmap.h b/hashmap.h
index d375d9cce7..cff77f9890 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -432,6 +432,8 @@ static inline void hashmap_enable_item_counting(struct hashmap *map)
 extern const void *memintern(const void *data, size_t len);
 static inline const char *strintern(const char *string)
 {
+	if (!string)
+		return string;
 	return memintern(string, strlen(string));
 }