mbox series

[00/11] Clarify API for dir.[ch] and unpack-trees.[ch] -- mark relevant fields as internal

Message ID pull.1149.git.1677143700.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Clarify API for dir.[ch] and unpack-trees.[ch] -- mark relevant fields as internal | expand

Message

Derrick Stolee via GitGitGadget Feb. 23, 2023, 9:14 a.m. UTC
I wrote this patch series about a year and a half ago, but never submitted
it. I rebased and updated it due to [0].

Some time ago, I noticed that struct dir_struct and struct
unpack_trees_options both have numerous fields meant for internal use only,
most of which are not marked as such. This has resulted in callers
accidentally trying to initialize some of these fields, and in at least one
case required a fair amount of review to verify other changes were okay --
review that would have been simplified with the apriori knowledge that a
combination of multiple fields were internal-only[1]. Looking closer, I
found that only 6 out of 18 fields in dir_struct were actually meant to be
public[2], and noted that unpack_trees_options also had 11 internal-only
fields (out of 36).

This patch is primarily about moving internal-only fields within these two
structs into an embedded internal struct. Patch breakdown:

 * Patches 1-3: Restructuring dir_struct
   * Patch 1: Splitting off internal-use-only fields
   * Patch 2: Add important usage note to avoid accidentally using
     deprecated API
   * Patch 3: Mark output-only fields as such
 * Patches 4-11: Restructuring unpack_trees_options
   * Patches 4-6: Preparatory cleanup
   * Patches 7-10: Splitting off internal-use-only fields
   * Patch 11: Mark output-only field as such

To make the benefit more clear, here are compressed versions of dir_struct
both before and after the changes. First, before:

struct dir_struct {
    int nr;
    int alloc;
    int ignored_nr;
    int ignored_alloc;
    enum [...] flags;
    struct dir_entry **entries;
    struct dir_entry **ignored;
    const char *exclude_per_dir;
#define EXC_CMDL 0
#define EXC_DIRS 1
#define EXC_FILE 2
    struct exclude_list_group exclude_list_group[3];
    struct exclude_stack *exclude_stack;
    struct path_pattern *pattern;
    struct strbuf basebuf;
    struct untracked_cache *untracked;
    struct oid_stat ss_info_exclude;
    struct oid_stat ss_excludes_file;
    unsigned unmanaged_exclude_files;
    unsigned visited_paths;
    unsigned visited_directories;
};


And after the changes:

struct dir_struct {
    enum [...] flags;
    int nr; /* output only */
    int ignored_nr; /* output only */
    struct dir_entry **entries; /* output only */
    struct dir_entry **ignored; /* output only */
    struct untracked_cache *untracked;
    const char *exclude_per_dir; /* deprecated */
    struct dir_struct_internal {
        int alloc;
        int ignored_alloc;
#define EXC_CMDL 0
#define EXC_DIRS 1
#define EXC_FILE 2
        struct exclude_list_group exclude_list_group[3];
        struct exclude_stack *exclude_stack;
        struct path_pattern *pattern;
        struct strbuf basebuf;
        struct oid_stat ss_info_exclude;
        struct oid_stat ss_excludes_file;
        unsigned unmanaged_exclude_files;
        unsigned visited_paths;
        unsigned visited_directories;
    } internal;
};


The former version has 18 fields (and 3 magic constants) which API users
will have to figure out. The latter makes it clear there are only at most 2
fields you should be setting upon input, and at most 4 which you read at
output, and the rest (including all the magic constants) you can ignore.

[0] Search for "Extremely yes" in
https://lore.kernel.org/git/CAJoAoZm+TkCL0Jpg_qFgKottxbtiG2QOiY0qGrz3-uQy+=waPg@mail.gmail.com/
[1]
https://lore.kernel.org/git/CABPp-BFSFN3WM6q7KzkD5mhrwsz--St_-ej5LbaY8Yr2sZzj=w@mail.gmail.com/
[2]
https://lore.kernel.org/git/CABPp-BHgot=CPNyK_xNfog_SqsNPNoCGfiSb-gZoS2sn_741dQ@mail.gmail.com/

Elijah Newren (11):
  dir: separate public from internal portion of dir_struct
  dir: add a usage note to exclude_per_dir
  dir: mark output only fields of dir_struct as such
  unpack-trees: clean up some flow control
  sparse-checkout: avoid using internal API of unpack-trees
  sparse-checkout: avoid using internal API of unpack-trees, take 2
  unpack_trees: start splitting internal fields from public API
  unpack-trees: mark fields only used internally as internal
  unpack-trees: rewrap a few overlong lines from previous patch
  unpack-trees: special case read-tree debugging as internal usage
  unpack-trees: add usage notices around df_conflict_entry

 builtin/read-tree.c       |  10 +-
 builtin/sparse-checkout.c |   4 +-
 dir.c                     | 114 +++++++++---------
 dir.h                     | 110 +++++++++--------
 unpack-trees.c            | 245 ++++++++++++++++++++------------------
 unpack-trees.h            |  42 ++++---
 6 files changed, 275 insertions(+), 250 deletions(-)


base-commit: 06dd2baa8da4a73421b959ec026a43711b9d77f9
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1149%2Fnewren%2Fclarify-api-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1149/newren/clarify-api-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1149

Comments

Derrick Stolee Feb. 23, 2023, 3:18 p.m. UTC | #1
On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote:
> This patch is primarily about moving internal-only fields within these two
> structs into an embedded internal struct. Patch breakdown:
> 
>  * Patches 1-3: Restructuring dir_struct
>    * Patch 1: Splitting off internal-use-only fields
>    * Patch 2: Add important usage note to avoid accidentally using
>      deprecated API
>    * Patch 3: Mark output-only fields as such
>  * Patches 4-11: Restructuring unpack_trees_options
>    * Patches 4-6: Preparatory cleanup
>    * Patches 7-10: Splitting off internal-use-only fields
>    * Patch 11: Mark output-only field as such
 
> And after the changes:
> 
> struct dir_struct {
>     enum [...] flags;
>     int nr; /* output only */
>     int ignored_nr; /* output only */
>     struct dir_entry **entries; /* output only */
>     struct dir_entry **ignored; /* output only */
>     struct untracked_cache *untracked;
>     const char *exclude_per_dir; /* deprecated */
>     struct dir_struct_internal {
>         int alloc;
>         int ignored_alloc;
> #define EXC_CMDL 0
> #define EXC_DIRS 1
> #define EXC_FILE 2
>         struct exclude_list_group exclude_list_group[3];
>         struct exclude_stack *exclude_stack;
>         struct path_pattern *pattern;
>         struct strbuf basebuf;
>         struct oid_stat ss_info_exclude;
>         struct oid_stat ss_excludes_file;
>         unsigned unmanaged_exclude_files;
>         unsigned visited_paths;
>         unsigned visited_directories;
>     } internal;
> };

This does present a very clear structure to avoid callers being
confused when writing these changes. It doesn't, however, present
any way to guarantee that callers can't mutate this state.

...here I go on a side track thinking of an alternative...

One way to track this would be to anonymously declare 'struct
dir_struct_internal' in the header file and let 'struct dir_struct'
contain a _pointer_ to the internal struct. The dir_struct_internal
can then be defined inside the .c file, limiting its scope. (It
must be a pointer in dir_struct or else callers would not be able
to create a dir_struct without using a pointer and an initializer
method.

The major downside to this pointer approach is that the internal
struct needs to be initialized within API calls and somehow cleared
by all callers. The internal data could be initialized by the common
initializers read_directory() or fill_directory(). There is a
dir_clear() that _should_ be called by all callers (but I notice we
are leaking the struct in at least one place in add-interactive.c,
and likely others).

This alternative adds some complexity to the structure, but
provides compiler-level guarantees that these internals are not used
outside of dir.c. I thought it worth exploring, even if we decide
that the complexity is not worth those guarantees.

The best news is that your existing series makes it easier to flip
to the internal pointer method in the future, since we can shift
the 'd->internal.member" uses into "d->internal->member" in a
mechanical way. Thus, the change you are proposing does not lock us
into this approach if we change our minds later.

Thanks,
-Stolee
Derrick Stolee Feb. 23, 2023, 3:26 p.m. UTC | #2
On 2/23/2023 10:18 AM, Derrick Stolee wrote:
> On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote:
>> This patch is primarily about moving internal-only fields within these two
>> structs into an embedded internal struct. Patch breakdown:
>>
>>  * Patches 1-3: Restructuring dir_struct
>>    * Patch 1: Splitting off internal-use-only fields
>>    * Patch 2: Add important usage note to avoid accidentally using
>>      deprecated API
>>    * Patch 3: Mark output-only fields as such
>>  * Patches 4-11: Restructuring unpack_trees_options
>>    * Patches 4-6: Preparatory cleanup
>>    * Patches 7-10: Splitting off internal-use-only fields
>>    * Patch 11: Mark output-only field as such
...
> The best news is that your existing series makes it easier to flip
> to the internal pointer method in the future, since we can shift
> the 'd->internal.member" uses into "d->internal->member" in a
> mechanical way. Thus, the change you are proposing does not lock us
> into this approach if we change our minds later.

And now that I've read the series in its entirety, I think it is
well organized and does not need any updates. It creates a better
situation than what we already have, and any changes to split the
internal structs to be anonymous to callers can be done as a
follow-up.

Thanks,
-Stolee
Elijah Newren Feb. 23, 2023, 8:31 p.m. UTC | #3
On Thu, Feb 23, 2023 at 7:18 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote:
> > This patch is primarily about moving internal-only fields within these two
> > structs into an embedded internal struct. Patch breakdown:
> >
> >  * Patches 1-3: Restructuring dir_struct
> >    * Patch 1: Splitting off internal-use-only fields
> >    * Patch 2: Add important usage note to avoid accidentally using
> >      deprecated API
> >    * Patch 3: Mark output-only fields as such
> >  * Patches 4-11: Restructuring unpack_trees_options
> >    * Patches 4-6: Preparatory cleanup
> >    * Patches 7-10: Splitting off internal-use-only fields
> >    * Patch 11: Mark output-only field as such
>
> > And after the changes:
> >
> > struct dir_struct {
> >     enum [...] flags;
> >     int nr; /* output only */
> >     int ignored_nr; /* output only */
> >     struct dir_entry **entries; /* output only */
> >     struct dir_entry **ignored; /* output only */
> >     struct untracked_cache *untracked;
> >     const char *exclude_per_dir; /* deprecated */
> >     struct dir_struct_internal {
> >         int alloc;
> >         int ignored_alloc;
> > #define EXC_CMDL 0
> > #define EXC_DIRS 1
> > #define EXC_FILE 2
> >         struct exclude_list_group exclude_list_group[3];
> >         struct exclude_stack *exclude_stack;
> >         struct path_pattern *pattern;
> >         struct strbuf basebuf;
> >         struct oid_stat ss_info_exclude;
> >         struct oid_stat ss_excludes_file;
> >         unsigned unmanaged_exclude_files;
> >         unsigned visited_paths;
> >         unsigned visited_directories;
> >     } internal;
> > };
>
> This does present a very clear structure to avoid callers being
> confused when writing these changes. It doesn't, however, present
> any way to guarantee that callers can't mutate this state.
>
> ...here I go on a side track thinking of an alternative...
>
> One way to track this would be to anonymously declare 'struct
> dir_struct_internal' in the header file and let 'struct dir_struct'
> contain a _pointer_ to the internal struct. The dir_struct_internal
> can then be defined inside the .c file, limiting its scope. (It
> must be a pointer in dir_struct or else callers would not be able
> to create a dir_struct without using a pointer and an initializer
> method.
>
> The major downside to this pointer approach is that the internal
> struct needs to be initialized within API calls and somehow cleared
> by all callers. The internal data could be initialized by the common
> initializers read_directory() or fill_directory(). There is a
> dir_clear() that _should_ be called by all callers (but I notice we
> are leaking the struct in at least one place in add-interactive.c,
> and likely others).
>
> This alternative adds some complexity to the structure, but
> provides compiler-level guarantees that these internals are not used
> outside of dir.c. I thought it worth exploring, even if we decide
> that the complexity is not worth those guarantees.

In addition to the guarantees that others won't muck with those
fields, such an approach would also buy us the following:

  * The implementation can change and internal data structures be
modified/extended/appended-to, without requiring recompiling all files
that depend upon our header.
  * Related to the last point, the ABI doesn't change when the
implementation and internal data structures change.  Might be helpful
for libification efforts.

For all three of these reasons, I pursued this alternate strategy you
bring up in merge-recursive and merge-ort (they both use a "struct
merge_options_internal *priv" and then define _very_ different "struct
merge_options_iternal" in their respective C files).  I thought about
using this strategy you suggest here, but was worried at the time I
created this patch that it might create more friction for Ævar who was
pushing his struct initialization work and memory leak cleanups in the
same area heavily at the time (and back then we had a few long threads
back and forth because our work was overlapping and clashing
slightly).  But I figured that doing at least this much was good,
because of what you point out:

> The best news is that your existing series makes it easier to flip
> to the internal pointer method in the future, since we can shift
> the 'd->internal.member" uses into "d->internal->member" in a
> mechanical way. Thus, the change you are proposing does not lock us
> into this approach if we change our minds later.
Elijah Newren Feb. 23, 2023, 8:35 p.m. UTC | #4
On Thu, Feb 23, 2023 at 7:26 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 2/23/2023 10:18 AM, Derrick Stolee wrote:
> > On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote:
> >> This patch is primarily about moving internal-only fields within these two
> >> structs into an embedded internal struct. Patch breakdown:
> >>
> >>  * Patches 1-3: Restructuring dir_struct
> >>    * Patch 1: Splitting off internal-use-only fields
> >>    * Patch 2: Add important usage note to avoid accidentally using
> >>      deprecated API
> >>    * Patch 3: Mark output-only fields as such
> >>  * Patches 4-11: Restructuring unpack_trees_options
> >>    * Patches 4-6: Preparatory cleanup
> >>    * Patches 7-10: Splitting off internal-use-only fields
> >>    * Patch 11: Mark output-only field as such
> ...
> > The best news is that your existing series makes it easier to flip
> > to the internal pointer method in the future, since we can shift
> > the 'd->internal.member" uses into "d->internal->member" in a
> > mechanical way. Thus, the change you are proposing does not lock us
> > into this approach if we change our minds later.
>
> And now that I've read the series in its entirety, I think it is
> well organized and does not need any updates. It creates a better
> situation than what we already have, and any changes to split the
> internal structs to be anonymous to callers can be done as a
> follow-up.

Wow, I was a bit worried pushing a couple dozen patches last night
that it'd be weeks before anyone took a look, and perhaps even that
I'd again get comments that I was pushing too many to the list.  You
read and reviewed all of them across both series, including some
comments showing you read pretty carefully, all before I had even
woken up.  Very cool; thanks.
Junio C Hamano Feb. 24, 2023, 1:24 a.m. UTC | #5
Derrick Stolee <derrickstolee@github.com> writes:

> The major downside to this pointer approach is that the internal
> struct needs to be initialized within API calls and somehow cleared
> by all callers. The internal data could be initialized by the common
> initializers read_directory() or fill_directory(). There is a
> dir_clear() that _should_ be called by all callers (but I notice we
> are leaking the struct in at least one place in add-interactive.c,
> and likely others).
>
> This alternative adds some complexity to the structure, but
> provides compiler-level guarantees that these internals are not used
> outside of dir.c. I thought it worth exploring, even if we decide
> that the complexity is not worth those guarantees.

I actually think the current structure may be a good place to stop
at.  Or we could use the original flat structure, but with members
that are supposed to be private prefixed with a longer prefix that
is very specific to the dir.c file, say "private_to_dir_c_".

Then have a block of #define

	#define alloc private_to_dir_c_alloc
	#define ignored_alloc private_to_dir_c_ignored_alloc
	...
	#define visited_directories private_to_dir_c_visited_directories

at the beginning dir.c to hide the cruft out of the implementation.
"git grep private_to_dir_c ':!dir.c'" would catch any outsider
peeking into the part of the struct that they shouldn't be touching.
Jacob Keller Feb. 24, 2023, 5:54 a.m. UTC | #6
On Thu, Feb 23, 2023 at 7:29 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote:
> > This patch is primarily about moving internal-only fields within these two
> > structs into an embedded internal struct. Patch breakdown:
> >
> >  * Patches 1-3: Restructuring dir_struct
> >    * Patch 1: Splitting off internal-use-only fields
> >    * Patch 2: Add important usage note to avoid accidentally using
> >      deprecated API
> >    * Patch 3: Mark output-only fields as such
> >  * Patches 4-11: Restructuring unpack_trees_options
> >    * Patches 4-6: Preparatory cleanup
> >    * Patches 7-10: Splitting off internal-use-only fields
> >    * Patch 11: Mark output-only field as such
>
> > And after the changes:
> >
> > struct dir_struct {
> >     enum [...] flags;
> >     int nr; /* output only */
> >     int ignored_nr; /* output only */
> >     struct dir_entry **entries; /* output only */
> >     struct dir_entry **ignored; /* output only */
> >     struct untracked_cache *untracked;
> >     const char *exclude_per_dir; /* deprecated */
> >     struct dir_struct_internal {
> >         int alloc;
> >         int ignored_alloc;
> > #define EXC_CMDL 0
> > #define EXC_DIRS 1
> > #define EXC_FILE 2
> >         struct exclude_list_group exclude_list_group[3];
> >         struct exclude_stack *exclude_stack;
> >         struct path_pattern *pattern;
> >         struct strbuf basebuf;
> >         struct oid_stat ss_info_exclude;
> >         struct oid_stat ss_excludes_file;
> >         unsigned unmanaged_exclude_files;
> >         unsigned visited_paths;
> >         unsigned visited_directories;
> >     } internal;
> > };
>
> This does present a very clear structure to avoid callers being
> confused when writing these changes. It doesn't, however, present
> any way to guarantee that callers can't mutate this state.
>
> ...here I go on a side track thinking of an alternative...
>
> One way to track this would be to anonymously declare 'struct
> dir_struct_internal' in the header file and let 'struct dir_struct'
> contain a _pointer_ to the internal struct. The dir_struct_internal
> can then be defined inside the .c file, limiting its scope. (It
> must be a pointer in dir_struct or else callers would not be able
> to create a dir_struct without using a pointer and an initializer
> method.
>
> The major downside to this pointer approach is that the internal
> struct needs to be initialized within API calls and somehow cleared
> by all callers. The internal data could be initialized by the common
> initializers read_directory() or fill_directory(). There is a
> dir_clear() that _should_ be called by all callers (but I notice we
> are leaking the struct in at least one place in add-interactive.c,
> and likely others).
>
> This alternative adds some complexity to the structure, but
> provides compiler-level guarantees that these internals are not used
> outside of dir.c. I thought it worth exploring, even if we decide
> that the complexity is not worth those guarantees.
>

Another approach, if you don't mind structure pointer math is to
create two structures:

a) the external public one in the public header file

struct dir_entry {
  <public stuff>
};

b) a private structure in the source file:

struct dir_entry_private {
  struct dir_entry entry;
  <private stuff>
};

In the source file you also define a macro/function that can take a
pointer to dir_entry and get a pointer to dir_entry_private:

struct dir_entry_private *dir_entry_to_private(struct dir_entry *entry)
{
  return (struct dir_entry_private *)(<calculate the offset that entry
inside dir_entry_private is, then subtract that from entry pointer>))
}

In Linux kernel this is container_of, not sure if git has this already
defined and its a common pattern.

Then you can set the code up such that the only way to allocate a
dir_entry is to call some function in the dir code. If a new entry
needs to be allocated, implement an alloc function that has the full
private structure definition.

This way you don't need an extra field in the dir_entry struct at all,
but at the cost of requiring special allocations. It works great for
code where the only way to get a dir_entry is already some other
functions that would ensure the private version is setup correctly.

> The best news is that your existing series makes it easier to flip
> to the internal pointer method in the future, since we can shift
> the 'd->internal.member" uses into "d->internal->member" in a
> mechanical way. Thus, the change you are proposing does not lock us
> into this approach if we change our minds later.
>

I think either approach is good. I like container_of because I'm quite
used to it in low level kernel code and its a good way to provide
private/public split abstractions there. The private pointer variation
is also a common approach to this problem and I think it sounds like
we already use it in a few places. Its perhaps better for those
reasons.

> Thanks,
> -Stolee
Jonathan Tan Feb. 24, 2023, 11:36 p.m. UTC | #7
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> I wrote this patch series about a year and a half ago, but never submitted
> it. I rebased and updated it due to [0].

[snip]

> [0] Search for "Extremely yes" in
> https://lore.kernel.org/git/CAJoAoZm+TkCL0Jpg_qFgKottxbtiG2QOiY0qGrz3-uQy+=waPg@mail.gmail.com/

I left some minor comments, but otherwise, this looks good. I do share
the desire to avoid unnecessary refactoring churn, but demarcation
of private and public fields does make code much clearer and can
potentially avoid a whole class of bugs, so I would be happy if this
patch set was merged in.