Message ID | pull.1149.git.1677143700.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Clarify API for dir.[ch] and unpack-trees.[ch] -- mark relevant fields as internal | expand |
On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote: > This patch is primarily about moving internal-only fields within these two > structs into an embedded internal struct. Patch breakdown: > > * Patches 1-3: Restructuring dir_struct > * Patch 1: Splitting off internal-use-only fields > * Patch 2: Add important usage note to avoid accidentally using > deprecated API > * Patch 3: Mark output-only fields as such > * Patches 4-11: Restructuring unpack_trees_options > * Patches 4-6: Preparatory cleanup > * Patches 7-10: Splitting off internal-use-only fields > * Patch 11: Mark output-only field as such > And after the changes: > > struct dir_struct { > enum [...] flags; > int nr; /* output only */ > int ignored_nr; /* output only */ > struct dir_entry **entries; /* output only */ > struct dir_entry **ignored; /* output only */ > struct untracked_cache *untracked; > const char *exclude_per_dir; /* deprecated */ > struct dir_struct_internal { > int alloc; > int ignored_alloc; > #define EXC_CMDL 0 > #define EXC_DIRS 1 > #define EXC_FILE 2 > struct exclude_list_group exclude_list_group[3]; > struct exclude_stack *exclude_stack; > struct path_pattern *pattern; > struct strbuf basebuf; > struct oid_stat ss_info_exclude; > struct oid_stat ss_excludes_file; > unsigned unmanaged_exclude_files; > unsigned visited_paths; > unsigned visited_directories; > } internal; > }; This does present a very clear structure to avoid callers being confused when writing these changes. It doesn't, however, present any way to guarantee that callers can't mutate this state. ...here I go on a side track thinking of an alternative... One way to track this would be to anonymously declare 'struct dir_struct_internal' in the header file and let 'struct dir_struct' contain a _pointer_ to the internal struct. The dir_struct_internal can then be defined inside the .c file, limiting its scope. (It must be a pointer in dir_struct or else callers would not be able to create a dir_struct without using a pointer and an initializer method. The major downside to this pointer approach is that the internal struct needs to be initialized within API calls and somehow cleared by all callers. The internal data could be initialized by the common initializers read_directory() or fill_directory(). There is a dir_clear() that _should_ be called by all callers (but I notice we are leaking the struct in at least one place in add-interactive.c, and likely others). This alternative adds some complexity to the structure, but provides compiler-level guarantees that these internals are not used outside of dir.c. I thought it worth exploring, even if we decide that the complexity is not worth those guarantees. The best news is that your existing series makes it easier to flip to the internal pointer method in the future, since we can shift the 'd->internal.member" uses into "d->internal->member" in a mechanical way. Thus, the change you are proposing does not lock us into this approach if we change our minds later. Thanks, -Stolee
On 2/23/2023 10:18 AM, Derrick Stolee wrote: > On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote: >> This patch is primarily about moving internal-only fields within these two >> structs into an embedded internal struct. Patch breakdown: >> >> * Patches 1-3: Restructuring dir_struct >> * Patch 1: Splitting off internal-use-only fields >> * Patch 2: Add important usage note to avoid accidentally using >> deprecated API >> * Patch 3: Mark output-only fields as such >> * Patches 4-11: Restructuring unpack_trees_options >> * Patches 4-6: Preparatory cleanup >> * Patches 7-10: Splitting off internal-use-only fields >> * Patch 11: Mark output-only field as such ... > The best news is that your existing series makes it easier to flip > to the internal pointer method in the future, since we can shift > the 'd->internal.member" uses into "d->internal->member" in a > mechanical way. Thus, the change you are proposing does not lock us > into this approach if we change our minds later. And now that I've read the series in its entirety, I think it is well organized and does not need any updates. It creates a better situation than what we already have, and any changes to split the internal structs to be anonymous to callers can be done as a follow-up. Thanks, -Stolee
On Thu, Feb 23, 2023 at 7:18 AM Derrick Stolee <derrickstolee@github.com> wrote: > > On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote: > > This patch is primarily about moving internal-only fields within these two > > structs into an embedded internal struct. Patch breakdown: > > > > * Patches 1-3: Restructuring dir_struct > > * Patch 1: Splitting off internal-use-only fields > > * Patch 2: Add important usage note to avoid accidentally using > > deprecated API > > * Patch 3: Mark output-only fields as such > > * Patches 4-11: Restructuring unpack_trees_options > > * Patches 4-6: Preparatory cleanup > > * Patches 7-10: Splitting off internal-use-only fields > > * Patch 11: Mark output-only field as such > > > And after the changes: > > > > struct dir_struct { > > enum [...] flags; > > int nr; /* output only */ > > int ignored_nr; /* output only */ > > struct dir_entry **entries; /* output only */ > > struct dir_entry **ignored; /* output only */ > > struct untracked_cache *untracked; > > const char *exclude_per_dir; /* deprecated */ > > struct dir_struct_internal { > > int alloc; > > int ignored_alloc; > > #define EXC_CMDL 0 > > #define EXC_DIRS 1 > > #define EXC_FILE 2 > > struct exclude_list_group exclude_list_group[3]; > > struct exclude_stack *exclude_stack; > > struct path_pattern *pattern; > > struct strbuf basebuf; > > struct oid_stat ss_info_exclude; > > struct oid_stat ss_excludes_file; > > unsigned unmanaged_exclude_files; > > unsigned visited_paths; > > unsigned visited_directories; > > } internal; > > }; > > This does present a very clear structure to avoid callers being > confused when writing these changes. It doesn't, however, present > any way to guarantee that callers can't mutate this state. > > ...here I go on a side track thinking of an alternative... > > One way to track this would be to anonymously declare 'struct > dir_struct_internal' in the header file and let 'struct dir_struct' > contain a _pointer_ to the internal struct. The dir_struct_internal > can then be defined inside the .c file, limiting its scope. (It > must be a pointer in dir_struct or else callers would not be able > to create a dir_struct without using a pointer and an initializer > method. > > The major downside to this pointer approach is that the internal > struct needs to be initialized within API calls and somehow cleared > by all callers. The internal data could be initialized by the common > initializers read_directory() or fill_directory(). There is a > dir_clear() that _should_ be called by all callers (but I notice we > are leaking the struct in at least one place in add-interactive.c, > and likely others). > > This alternative adds some complexity to the structure, but > provides compiler-level guarantees that these internals are not used > outside of dir.c. I thought it worth exploring, even if we decide > that the complexity is not worth those guarantees. In addition to the guarantees that others won't muck with those fields, such an approach would also buy us the following: * The implementation can change and internal data structures be modified/extended/appended-to, without requiring recompiling all files that depend upon our header. * Related to the last point, the ABI doesn't change when the implementation and internal data structures change. Might be helpful for libification efforts. For all three of these reasons, I pursued this alternate strategy you bring up in merge-recursive and merge-ort (they both use a "struct merge_options_internal *priv" and then define _very_ different "struct merge_options_iternal" in their respective C files). I thought about using this strategy you suggest here, but was worried at the time I created this patch that it might create more friction for Ævar who was pushing his struct initialization work and memory leak cleanups in the same area heavily at the time (and back then we had a few long threads back and forth because our work was overlapping and clashing slightly). But I figured that doing at least this much was good, because of what you point out: > The best news is that your existing series makes it easier to flip > to the internal pointer method in the future, since we can shift > the 'd->internal.member" uses into "d->internal->member" in a > mechanical way. Thus, the change you are proposing does not lock us > into this approach if we change our minds later.
On Thu, Feb 23, 2023 at 7:26 AM Derrick Stolee <derrickstolee@github.com> wrote: > > On 2/23/2023 10:18 AM, Derrick Stolee wrote: > > On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote: > >> This patch is primarily about moving internal-only fields within these two > >> structs into an embedded internal struct. Patch breakdown: > >> > >> * Patches 1-3: Restructuring dir_struct > >> * Patch 1: Splitting off internal-use-only fields > >> * Patch 2: Add important usage note to avoid accidentally using > >> deprecated API > >> * Patch 3: Mark output-only fields as such > >> * Patches 4-11: Restructuring unpack_trees_options > >> * Patches 4-6: Preparatory cleanup > >> * Patches 7-10: Splitting off internal-use-only fields > >> * Patch 11: Mark output-only field as such > ... > > The best news is that your existing series makes it easier to flip > > to the internal pointer method in the future, since we can shift > > the 'd->internal.member" uses into "d->internal->member" in a > > mechanical way. Thus, the change you are proposing does not lock us > > into this approach if we change our minds later. > > And now that I've read the series in its entirety, I think it is > well organized and does not need any updates. It creates a better > situation than what we already have, and any changes to split the > internal structs to be anonymous to callers can be done as a > follow-up. Wow, I was a bit worried pushing a couple dozen patches last night that it'd be weeks before anyone took a look, and perhaps even that I'd again get comments that I was pushing too many to the list. You read and reviewed all of them across both series, including some comments showing you read pretty carefully, all before I had even woken up. Very cool; thanks.
Derrick Stolee <derrickstolee@github.com> writes: > The major downside to this pointer approach is that the internal > struct needs to be initialized within API calls and somehow cleared > by all callers. The internal data could be initialized by the common > initializers read_directory() or fill_directory(). There is a > dir_clear() that _should_ be called by all callers (but I notice we > are leaking the struct in at least one place in add-interactive.c, > and likely others). > > This alternative adds some complexity to the structure, but > provides compiler-level guarantees that these internals are not used > outside of dir.c. I thought it worth exploring, even if we decide > that the complexity is not worth those guarantees. I actually think the current structure may be a good place to stop at. Or we could use the original flat structure, but with members that are supposed to be private prefixed with a longer prefix that is very specific to the dir.c file, say "private_to_dir_c_". Then have a block of #define #define alloc private_to_dir_c_alloc #define ignored_alloc private_to_dir_c_ignored_alloc ... #define visited_directories private_to_dir_c_visited_directories at the beginning dir.c to hide the cruft out of the implementation. "git grep private_to_dir_c ':!dir.c'" would catch any outsider peeking into the part of the struct that they shouldn't be touching.
On Thu, Feb 23, 2023 at 7:29 AM Derrick Stolee <derrickstolee@github.com> wrote: > > On 2/23/2023 4:14 AM, Elijah Newren via GitGitGadget wrote: > > This patch is primarily about moving internal-only fields within these two > > structs into an embedded internal struct. Patch breakdown: > > > > * Patches 1-3: Restructuring dir_struct > > * Patch 1: Splitting off internal-use-only fields > > * Patch 2: Add important usage note to avoid accidentally using > > deprecated API > > * Patch 3: Mark output-only fields as such > > * Patches 4-11: Restructuring unpack_trees_options > > * Patches 4-6: Preparatory cleanup > > * Patches 7-10: Splitting off internal-use-only fields > > * Patch 11: Mark output-only field as such > > > And after the changes: > > > > struct dir_struct { > > enum [...] flags; > > int nr; /* output only */ > > int ignored_nr; /* output only */ > > struct dir_entry **entries; /* output only */ > > struct dir_entry **ignored; /* output only */ > > struct untracked_cache *untracked; > > const char *exclude_per_dir; /* deprecated */ > > struct dir_struct_internal { > > int alloc; > > int ignored_alloc; > > #define EXC_CMDL 0 > > #define EXC_DIRS 1 > > #define EXC_FILE 2 > > struct exclude_list_group exclude_list_group[3]; > > struct exclude_stack *exclude_stack; > > struct path_pattern *pattern; > > struct strbuf basebuf; > > struct oid_stat ss_info_exclude; > > struct oid_stat ss_excludes_file; > > unsigned unmanaged_exclude_files; > > unsigned visited_paths; > > unsigned visited_directories; > > } internal; > > }; > > This does present a very clear structure to avoid callers being > confused when writing these changes. It doesn't, however, present > any way to guarantee that callers can't mutate this state. > > ...here I go on a side track thinking of an alternative... > > One way to track this would be to anonymously declare 'struct > dir_struct_internal' in the header file and let 'struct dir_struct' > contain a _pointer_ to the internal struct. The dir_struct_internal > can then be defined inside the .c file, limiting its scope. (It > must be a pointer in dir_struct or else callers would not be able > to create a dir_struct without using a pointer and an initializer > method. > > The major downside to this pointer approach is that the internal > struct needs to be initialized within API calls and somehow cleared > by all callers. The internal data could be initialized by the common > initializers read_directory() or fill_directory(). There is a > dir_clear() that _should_ be called by all callers (but I notice we > are leaking the struct in at least one place in add-interactive.c, > and likely others). > > This alternative adds some complexity to the structure, but > provides compiler-level guarantees that these internals are not used > outside of dir.c. I thought it worth exploring, even if we decide > that the complexity is not worth those guarantees. > Another approach, if you don't mind structure pointer math is to create two structures: a) the external public one in the public header file struct dir_entry { <public stuff> }; b) a private structure in the source file: struct dir_entry_private { struct dir_entry entry; <private stuff> }; In the source file you also define a macro/function that can take a pointer to dir_entry and get a pointer to dir_entry_private: struct dir_entry_private *dir_entry_to_private(struct dir_entry *entry) { return (struct dir_entry_private *)(<calculate the offset that entry inside dir_entry_private is, then subtract that from entry pointer>)) } In Linux kernel this is container_of, not sure if git has this already defined and its a common pattern. Then you can set the code up such that the only way to allocate a dir_entry is to call some function in the dir code. If a new entry needs to be allocated, implement an alloc function that has the full private structure definition. This way you don't need an extra field in the dir_entry struct at all, but at the cost of requiring special allocations. It works great for code where the only way to get a dir_entry is already some other functions that would ensure the private version is setup correctly. > The best news is that your existing series makes it easier to flip > to the internal pointer method in the future, since we can shift > the 'd->internal.member" uses into "d->internal->member" in a > mechanical way. Thus, the change you are proposing does not lock us > into this approach if we change our minds later. > I think either approach is good. I like container_of because I'm quite used to it in low level kernel code and its a good way to provide private/public split abstractions there. The private pointer variation is also a common approach to this problem and I think it sounds like we already use it in a few places. Its perhaps better for those reasons. > Thanks, > -Stolee
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > I wrote this patch series about a year and a half ago, but never submitted > it. I rebased and updated it due to [0]. [snip] > [0] Search for "Extremely yes" in > https://lore.kernel.org/git/CAJoAoZm+TkCL0Jpg_qFgKottxbtiG2QOiY0qGrz3-uQy+=waPg@mail.gmail.com/ I left some minor comments, but otherwise, this looks good. I do share the desire to avoid unnecessary refactoring churn, but demarcation of private and public fields does make code much clearer and can potentially avoid a whole class of bugs, so I would be happy if this patch set was merged in.