mbox series

[0/6,RFC] Lazy-loaded default Git config

Message ID pull.1539.git.1685716420.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Lazy-loaded default Git config | expand

Message

Philippe Blain via GitGitGadget June 2, 2023, 2:33 p.m. UTC
While working on the stronger protections around replace refs config [1], I
wanted to think about a way to scale that concept of ensuring that we've
read the config before accessing the global to a wider scope of config keys.

[1]
https://lore.kernel.org/git/pull.1537.v2.git.1685716157.gitgitgadget@gmail.com/

If you wish to apply these patches, then do so on top of v2 of [1].

This RFC attempt to explore this space, to get a feeling whether this is
solving a worthwhile problem and what this system should look like.

One example option is presented here, but I'm not super-confident in it and
only implemented enough translations such that I could gather enough
examples to know this would work at least a little bit.

The basic idea is to create a new 'global-config' library that has enums for
different config keys, and from each enum value we can load the config
string and the stored global value, as well as access an indicator that the
value has been loaded from config yet.

In this way, we can guard accesses to the global state through a method
which will lazy-load the config value as needed. This avoids the kind of bug
where a builtin forgets to initialize config through a git_config() call at
the right time.

One of the issues that becomes particularly important is that some of our
globals are accessed before config is available. Specifically, in patch 6,
we replace the ignore_case global with the method accessor. This is used in
fspathcmp() to toggle between strcasecmp() and strcmp(). However, this
method is called as part of the process to set up the Git directory, which
must happen before we can start loading config. The current behavior is to
always use the default of 0 at this time.

To get around this, I introduce a declare_config_available() method which is
called at the right time in git.c and test-tool.c. Before this is called, we
will silently pass the default value instead of loading from config. Without
this behavior, our test suite would fail in a BUG() statement due to the
inconsistent use of the config library.

My hope with submitting this RFC is that we could come to some conclusions
on these questions:

 * Is this a worthwhile category of bug to protect against?
 * Is wrapping global state in accessor methods the right way to protect
   against that class of bugs?
 * Are there other benefits to doing this?
 * Are there reasons why this cure is worse than the disease?

There are a few things that I personally find could use improvement and I
would like to update before submitting this for full review. (I chose to not
spend extra time re-working this until there is agreement on the general
direction.) Here are the things I've identified so far:

 1. The (static) global state in global-config.c is split across two arrays
    and a bitmask. It would be better to have a single array of structs so
    we can have the default values described right next to the config key
    strings.
 2. The current accessor method is limited to returning ints, and is named
    as such. But many of the keys it replace are actually boolean-valued.
    The current implementation tries ...get_maybe_bool() before then trying
    ...get_int(), which would extend values like core.filemode to allow
    taking the value "3". This could be fixed by using the struct approach
    and signalling that some int-typed globals only allow boolean values.
    (This holds the same for int-valued things that shouldn't take "true".)
 3. The method is currently scoped globally, and does not allow for
    repository-scoped values. None of these globals are currently scoped to
    repositories, but it would be less intrusive to do repository-scoping
    later if the accessor method allowed repository scoping and then looked
    at the key-specific settings as to whether we should look globally or in
    a repository.
 4. The replacement of a constant like 'ignore_case' with a method like
    'get_int_config_global(INT_CONFIG_IGNORE_CASE)' is quite verbose and
    ugly. I specifically chose this approach instead of dedicated methods
    like 'replace_refs_enabled()' in [1] because I didn't think that
    approach would scale. However, we could perhaps use macros as a way to
    reduce the verbosity, especially when deciding to add repository-scoping
    as necessary. For instance, we could define 'get_ignore_case()' to be
    'get_int_config_global(the_repository, INT_CONFIG_IGNORE_CASE)' in
    global-config.h to avoid spreading this verbosity across the codebase.
    Alternatively, we could define these as inline methods if that is
    preferred.

Thank you for your help determining the right way forward here.

Thanks, -Stolee

Derrick Stolee (6):
  config: create new global config helpers
  config: add trust_executable_bit to global config
  config: move trust_ctime to global config
  config: move quote_path_fully to global config
  config: move has_symlinks to global config
  config: move ignore_case to global config

 Makefile                                |  1 +
 apply.c                                 |  6 ++-
 builtin/difftool.c                      |  4 +-
 builtin/mv.c                            |  2 +-
 cache.h                                 |  9 ++++-
 combine-diff.c                          |  2 +-
 config.c                                | 24 -----------
 dir.c                                   | 23 ++++++-----
 entry.c                                 |  3 +-
 environment.c                           |  4 --
 environment.h                           |  4 --
 git.c                                   |  4 ++
 global-config.c                         | 54 +++++++++++++++++++++++++
 global-config.h                         | 30 ++++++++++++++
 merge-recursive.c                       | 11 ++---
 name-hash.c                             |  6 +--
 quote.c                                 |  6 +--
 quote.h                                 |  2 -
 read-cache.c                            | 22 +++++-----
 t/helper/test-lazy-init-name-hash.c     |  5 ---
 t/helper/test-tool.c                    |  3 ++
 t/perf/p0004-lazy-init-name-hash.sh     |  6 +++
 t/t3008-ls-files-lazy-init-name-hash.sh |  1 +
 unpack-trees.c                          |  2 +-
 24 files changed, 156 insertions(+), 78 deletions(-)
 create mode 100644 global-config.c
 create mode 100644 global-config.h


base-commit: 4c7dbeb8c6dd6ab4c1903196967d5e0906a293c2
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1539%2Fderrickstolee%2Fdefault-config-safety-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1539/derrickstolee/default-config-safety-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1539

Comments

Glen Choo June 8, 2023, 6:19 p.m. UTC | #1
Hi Stolee!

Thanks for coming to the Libification discussion today, it was nice to
be able to discuss this idea with a bigger group.

As is custom, I'll repost my own thoughts here. If folks are interested
in a summary of the whole discussion, you can find the notes at

  https://docs.google.com/document/d/1mw_qPPgfQUv1gfKMwmvZjpYaOOzxcNH2N8bRbvBWfIQ/edit#heading=h.fcm7uhwlk55z

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  * Is this a worthwhile category of bug to protect against?

If we look past the concrete bugs and consider the general problem of
process-wide state being hard to use correctly, I think this is
definitely worth solving.

And in case anyone doubts the the current interface is hard to use
correctly: the state lives in poorly scoped global variables that need
to be manually initialized by calling git_config() at the right time
with git_default_config. It's also hard to remember to do this for your
builtin because some builtins call git_default_config in _their own_
outer config_fn_t, and others call git_config(git_default_config).

>  * Is wrapping global state in accessor methods the right way to protect
>    against that class of bugs?

Lazy-loading via accessor methods seem slightly excessive. The crux of
the problem is that we haven't initialized the values before they are
read, so I think we can get away with plain fields in a struct as long
as they are always initialized (e.g. the builtin author doesn't need to
remember to do this). We could initialize all of the fields during the
setup process, where you placed declare_config_available(), at which
point config should be available.

If we use config hash lookups to intialize the values we want,
pre-initializing like this shouldn't be noticeably more wasteful
compared to lazy-loading, since in either case, we are already parsing
all of the config into the in-memory config sets and looking them up
with hash lookups. Pre-initializing does a small bit more work upfront
by assigning to the fields, but it means that accessing the settings
later can be faster since we can avoid the "getter method" calls.

>  * Are there other benefits to doing this?

Yes! I'm generally excited about encapsulating loose global variables
into an appropriate abstraction, e.g. it was something I was thinking
about this when I was working on protected config (which is loose global
state, but also sort of a collection of repo-wide settings). It'll be
extremely relevant for libification, since this paves the way to
eventually encapsulate the process-wide state in a way that makes it
separable from the library code.

>  * Are there reasons why this cure is worse than the disease?

This seems quite similar to what we're doing with repo-settings.c (which
also has its own problems with initialization), and I'd like to avoid
having two similar-looking-but-different systems in the long run. For
this to go forward, I'd like to see either an explicit goal to deprecate
repo-settings.c (e.g. this new system handles repository settings in
addition to process-wide settings), or a reasoned separation between
them.