diff mbox series

do we have too much fsync() configuration in 'next'? (was: [PATCH v7] core.fsync: documentation and user-friendly aggregate options)

Message ID 220323.86fsn8ohg8.gmgdl@evledraar.gmail.com (mailing list archive)
State New, archived
Headers show
Series do we have too much fsync() configuration in 'next'? (was: [PATCH v7] core.fsync: documentation and user-friendly aggregate options) | expand

Commit Message

Ævar Arnfjörð Bjarmason March 23, 2022, 2:20 p.m. UTC
On Tue, Mar 15 2022, Neeraj Singh wrote:

I know this is probably 80% my fault by egging you on about initially
adding the wildmatch() based thing you didn't go for.

But having looked at this with fresh eyes quite deeply I really think
we're severely over-configuring things here:

> +core.fsync::
> +	A comma-separated list of components of the repository that
> +	should be hardened via the core.fsyncMethod when created or
> +	modified.  You can disable hardening of any component by
> +	prefixing it with a '-'.  Items that are not hardened may be
> +	lost in the event of an unclean	system shutdown. Unless you
> +	have special requirements, it is recommended that you leave
> +	this option empty or pick one of `committed`, `added`,
> +	or `all`.
> ++
> +When this configuration is encountered, the set of components starts with
> +the platform default value, disabled components are removed, and additional
> +components are added. `none` resets the state so that the platform default
> +is ignored.
> ++
> +The empty string resets the fsync configuration to the platform
> +default. The default on most platforms is equivalent to
> +`core.fsync=committed,-loose-object`, which has good performance,
> +but risks losing recent work in the event of an unclean system shutdown.
> ++
> +* `none` clears the set of fsynced components.
> +* `loose-object` hardens objects added to the repo in loose-object form.
> +* `pack` hardens objects added to the repo in packfile form.
> +* `pack-metadata` hardens packfile bitmaps and indexes.
> +* `commit-graph` hardens the commit graph file.
> +* `index` hardens the index when it is modified.
> +* `objects` is an aggregate option that is equivalent to
> +  `loose-object,pack`.
> +* `derived-metadata` is an aggregate option that is equivalent to
> +  `pack-metadata,commit-graph`.
> +* `committed` is an aggregate option that is currently equivalent to
> +  `objects`. This mode sacrifices some performance to ensure that work
> +  that is committed to the repository with `git commit` or similar commands
> +  is hardened.
> +* `added` is an aggregate option that is currently equivalent to
> +  `committed,index`. This mode sacrifices additional performance to
> +  ensure that the results of commands like `git add` and similar operations
> +  are hardened.
> +* `all` is an aggregate option that syncs all individual components above.
> +
>  core.fsyncMethod::
>  	A value indicating the strategy Git will use to harden repository data
>  	using fsync and related primitives.

On top of my
https://lore.kernel.org/git/RFC-patch-v2-7.7-a5951366c6e-20220323T140753Z-avarab@gmail.com/
which makes the tmp-objdir part of your not-in-next-just-seen follow-up
series configurable via "fsyncMethod.batch.quarantine" I really think we
should just go for something like the belwo patch (note that
misspelled/mistook "bulk" for "batch" in that linked-t patch, fixed
below.

I.e. I think we should just do our default fsync() of everything, and
probably SOON make the fsync-ing of loose objects the default. Those who
care about performance will have "batch" (or "writeout-only"), which we
can have OS-specific detections for.

But really, all of the rest of this is unduly boxing us into
overconfiguration that I think nobody really needs.

If someone really needs this level of detail they can LD_PRELOAD
something to have fsync intercept fd's and paths, and act appropriately.

Worse, as the RFC series I sent
(https://lore.kernel.org/git/RFC-cover-v2-0.7-00000000000-20220323T140753Z-avarab@gmail.com/)
shows we can and should "batch" up fsync() operations across these
configuration boundaries, which this level of configuration would seem
to preclude.

Or, we'd need to explain why "core.fsync=loose-object" won't *actually*
call fsync() on a single loose object's fd under "batch" as I had to do
on top of this in
https://lore.kernel.org/git/RFC-patch-v2-6.7-c20301d7967-20220323T140753Z-avarab@gmail.com/

The same is going to apply for almost all of the rest of these
configuration categories.

I.e. a natural follow-up to e.g. batching across objects & index as I'm
doing in
https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@gmail.com/
is to do likewise for all the PACK-related stuff before we rename it
in-place. Or even have "git gc" issue only a single fsync() for all of
PACKs, their metadata files, commit-graph etc., and then rename() things
in-place as appropriate afterwards.

Comments

Neeraj Singh March 25, 2022, 9:24 p.m. UTC | #1
On Wed, Mar 23, 2022 at 7:46 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Tue, Mar 15 2022, Neeraj Singh wrote:
>
> I know this is probably 80% my fault by egging you on about initially
> adding the wildmatch() based thing you didn't go for.
>
> But having looked at this with fresh eyes quite deeply I really think
> we're severely over-configuring things here:
>
> > +core.fsync::
> > +     A comma-separated list of components of the repository that
> > +     should be hardened via the core.fsyncMethod when created or
> > +     modified.  You can disable hardening of any component by
> > +     prefixing it with a '-'.  Items that are not hardened may be
> > +     lost in the event of an unclean system shutdown. Unless you
> > +     have special requirements, it is recommended that you leave
> > +     this option empty or pick one of `committed`, `added`,
> > +     or `all`.
> > ++
> > +When this configuration is encountered, the set of components starts with
> > +the platform default value, disabled components are removed, and additional
> > +components are added. `none` resets the state so that the platform default
> > +is ignored.
> > ++
> > +The empty string resets the fsync configuration to the platform
> > +default. The default on most platforms is equivalent to
> > +`core.fsync=committed,-loose-object`, which has good performance,
> > +but risks losing recent work in the event of an unclean system shutdown.
> > ++
> > +* `none` clears the set of fsynced components.
> > +* `loose-object` hardens objects added to the repo in loose-object form.
> > +* `pack` hardens objects added to the repo in packfile form.
> > +* `pack-metadata` hardens packfile bitmaps and indexes.
> > +* `commit-graph` hardens the commit graph file.
> > +* `index` hardens the index when it is modified.
> > +* `objects` is an aggregate option that is equivalent to
> > +  `loose-object,pack`.
> > +* `derived-metadata` is an aggregate option that is equivalent to
> > +  `pack-metadata,commit-graph`.
> > +* `committed` is an aggregate option that is currently equivalent to
> > +  `objects`. This mode sacrifices some performance to ensure that work
> > +  that is committed to the repository with `git commit` or similar commands
> > +  is hardened.
> > +* `added` is an aggregate option that is currently equivalent to
> > +  `committed,index`. This mode sacrifices additional performance to
> > +  ensure that the results of commands like `git add` and similar operations
> > +  are hardened.
> > +* `all` is an aggregate option that syncs all individual components above.
> > +
> >  core.fsyncMethod::
> >       A value indicating the strategy Git will use to harden repository data
> >       using fsync and related primitives.
>
> On top of my
> https://lore.kernel.org/git/RFC-patch-v2-7.7-a5951366c6e-20220323T140753Z-avarab@gmail.com/
> which makes the tmp-objdir part of your not-in-next-just-seen follow-up
> series configurable via "fsyncMethod.batch.quarantine" I really think we
> should just go for something like the belwo patch (note that
> misspelled/mistook "bulk" for "batch" in that linked-t patch, fixed
> below.
>
> I.e. I think we should just do our default fsync() of everything, and
> probably SOON make the fsync-ing of loose objects the default. Those who
> care about performance will have "batch" (or "writeout-only"), which we
> can have OS-specific detections for.
>
> But really, all of the rest of this is unduly boxing us into
> overconfiguration that I think nobody really needs.
>

We've gone over this a few times already, but just wanted to state it
again.  The really detailed settings are really there for Git hosters
like GitLab or GitHub. I'd be happy to remove the per-component
core.fsync values from the documentation and leave just the ones we
point the user to.

> If someone really needs this level of detail they can LD_PRELOAD
> something to have fsync intercept fd's and paths, and act appropriately.
>
> Worse, as the RFC series I sent
> (https://lore.kernel.org/git/RFC-cover-v2-0.7-00000000000-20220323T140753Z-avarab@gmail.com/)
> shows we can and should "batch" up fsync() operations across these
> configuration boundaries, which this level of configuration would seem
> to preclude.
>
> Or, we'd need to explain why "core.fsync=loose-object" won't *actually*
> call fsync() on a single loose object's fd under "batch" as I had to do
> on top of this in
> https://lore.kernel.org/git/RFC-patch-v2-6.7-c20301d7967-20220323T140753Z-avarab@gmail.com/
>

99.9% of users don't care and won't look.  The ones who do look deeper
and understand the issues have source code and access to this ML
discussion to understand why this works this way.

> The same is going to apply for almost all of the rest of these
> configuration categories.
>
> I.e. a natural follow-up to e.g. batching across objects & index as I'm
> doing in
> https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@gmail.com/
> is to do likewise for all the PACK-related stuff before we rename it
> in-place. Or even have "git gc" issue only a single fsync() for all of
> PACKs, their metadata files, commit-graph etc., and then rename() things
> in-place as appropriate afterwards.
>
> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> index 365a12dc7ae..536238e209b 100644
> --- a/Documentation/config/core.txt
> +++ b/Documentation/config/core.txt
> @@ -548,49 +548,35 @@ core.whitespace::
>    errors. The default tab width is 8. Allowed values are 1 to 63.
>
>  core.fsync::
> -       A comma-separated list of components of the repository that
> -       should be hardened via the core.fsyncMethod when created or
> -       modified.  You can disable hardening of any component by
> -       prefixing it with a '-'.  Items that are not hardened may be
> -       lost in the event of an unclean system shutdown. Unless you
> -       have special requirements, it is recommended that you leave
> -       this option empty or pick one of `committed`, `added`,
> -       or `all`.
> -+
> -When this configuration is encountered, the set of components starts with
> -the platform default value, disabled components are removed, and additional
> -components are added. `none` resets the state so that the platform default
> -is ignored.
> -+
> -The empty string resets the fsync configuration to the platform
> -default. The default on most platforms is equivalent to
> -`core.fsync=committed,-loose-object`, which has good performance,
> -but risks losing recent work in the event of an unclean system shutdown.
> -+
> -* `none` clears the set of fsynced components.
> -* `loose-object` hardens objects added to the repo in loose-object form.
> -* `pack` hardens objects added to the repo in packfile form.
> -* `pack-metadata` hardens packfile bitmaps and indexes.
> -* `commit-graph` hardens the commit graph file.
> -* `index` hardens the index when it is modified.
> -* `objects` is an aggregate option that is equivalent to
> -  `loose-object,pack`.
> -* `derived-metadata` is an aggregate option that is equivalent to
> -  `pack-metadata,commit-graph`.
> -* `committed` is an aggregate option that is currently equivalent to
> -  `objects`. This mode sacrifices some performance to ensure that work
> -  that is committed to the repository with `git commit` or similar commands
> -  is hardened.
> -* `added` is an aggregate option that is currently equivalent to
> -  `committed,index`. This mode sacrifices additional performance to
> -  ensure that the results of commands like `git add` and similar operations
> -  are hardened.
> -* `all` is an aggregate option that syncs all individual components above.
> +       A boolen defaulting to `true`. To ensure data integrity git
> +       will fsync() its objects, index and refu updates etc. This can
> +       be set to `false` to disable `fsync()`-ing.
> ++
> +Only set this to `false` if you know what you're doing, and are
> +prepared to deal with data corruption. Valid use-cases include
> +throwaway uses of repositories on ramdisks, one-off mass-imports
> +followed by calling `sync(1)` etc.
> ++
> +Note that the syncing of loose objects is currently excluded from
> +`core.fsync=true`. To turn on all fsync-ing you'll need
> +`core.fsync=true` and `core.fsyncObjectFiles=true`, but see
> +`core.fsyncMethod=batch` below for a much faster alternative that's
> +just as safe on various modern OS's.
> ++
> +The default is in flux and may change in the future, in particular the
> +equivalent of the already-deprecated `core.fsyncObjectFiles` setting
> +might soon default to `true`, and `core.fsyncMethod`'s default of
> +`fsync` might default to a setting deemed to be safe on the local OS,
> +suc has `batch` or `writeout-only`
>
>  core.fsyncMethod::
>         A value indicating the strategy Git will use to harden repository data
>         using fsync and related primitives.
>  +
> +Defaults to `fsync`, but as discussed for `core.fsync` above might
> +change to use one of the values below taking advantage of
> +platform-specific "faster `fsync()`".
> ++
>  * `fsync` uses the fsync() system call or platform equivalents.
>  * `writeout-only` issues pagecache writeback requests, but depending on the
>    filesystem and storage hardware, data added to the repository may not be
> @@ -680,8 +666,8 @@ backed up by any standard (e.g. POSIX), but worked in practice on some
>  Linux setups.
>  +
>  Nowadays you should almost certainly want to use
> -`core.fsync=loose-object` instead in combination with
> -`core.fsyncMethod=bulk`, and possibly with
> +`core.fsync=true` instead in combination with
> +`core.fsyncMethod=batch`, and possibly with
>  `fsyncMethod.batch.quarantine=true`, see above. On modern OS's (Linux,
>  OSX, Windows) that gives you most of the performance benefit of
>  `core.fsyncObjectFiles=false` with all of the safety of the old

I'm at the point where I don't want to endlessly revisit this discussion.

-Neeraj
Ævar Arnfjörð Bjarmason March 26, 2022, 12:24 a.m. UTC | #2
On Fri, Mar 25 2022, Neeraj Singh wrote:

> On Wed, Mar 23, 2022 at 7:46 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Tue, Mar 15 2022, Neeraj Singh wrote:
>>
>> I know this is probably 80% my fault by egging you on about initially
>> adding the wildmatch() based thing you didn't go for.
>>
>> But having looked at this with fresh eyes quite deeply I really think
>> we're severely over-configuring things here:
>>
>> > +core.fsync::
>> > +     A comma-separated list of components of the repository that
>> > +     should be hardened via the core.fsyncMethod when created or
>> > +     modified.  You can disable hardening of any component by
>> > +     prefixing it with a '-'.  Items that are not hardened may be
>> > +     lost in the event of an unclean system shutdown. Unless you
>> > +     have special requirements, it is recommended that you leave
>> > +     this option empty or pick one of `committed`, `added`,
>> > +     or `all`.
>> > ++
>> > +When this configuration is encountered, the set of components starts with
>> > +the platform default value, disabled components are removed, and additional
>> > +components are added. `none` resets the state so that the platform default
>> > +is ignored.
>> > ++
>> > +The empty string resets the fsync configuration to the platform
>> > +default. The default on most platforms is equivalent to
>> > +`core.fsync=committed,-loose-object`, which has good performance,
>> > +but risks losing recent work in the event of an unclean system shutdown.
>> > ++
>> > +* `none` clears the set of fsynced components.
>> > +* `loose-object` hardens objects added to the repo in loose-object form.
>> > +* `pack` hardens objects added to the repo in packfile form.
>> > +* `pack-metadata` hardens packfile bitmaps and indexes.
>> > +* `commit-graph` hardens the commit graph file.
>> > +* `index` hardens the index when it is modified.
>> > +* `objects` is an aggregate option that is equivalent to
>> > +  `loose-object,pack`.
>> > +* `derived-metadata` is an aggregate option that is equivalent to
>> > +  `pack-metadata,commit-graph`.
>> > +* `committed` is an aggregate option that is currently equivalent to
>> > +  `objects`. This mode sacrifices some performance to ensure that work
>> > +  that is committed to the repository with `git commit` or similar commands
>> > +  is hardened.
>> > +* `added` is an aggregate option that is currently equivalent to
>> > +  `committed,index`. This mode sacrifices additional performance to
>> > +  ensure that the results of commands like `git add` and similar operations
>> > +  are hardened.
>> > +* `all` is an aggregate option that syncs all individual components above.
>> > +
>> >  core.fsyncMethod::
>> >       A value indicating the strategy Git will use to harden repository data
>> >       using fsync and related primitives.
>>
>> On top of my
>> https://lore.kernel.org/git/RFC-patch-v2-7.7-a5951366c6e-20220323T140753Z-avarab@gmail.com/
>> which makes the tmp-objdir part of your not-in-next-just-seen follow-up
>> series configurable via "fsyncMethod.batch.quarantine" I really think we
>> should just go for something like the belwo patch (note that
>> misspelled/mistook "bulk" for "batch" in that linked-t patch, fixed
>> below.
>>
>> I.e. I think we should just do our default fsync() of everything, and
>> probably SOON make the fsync-ing of loose objects the default. Those who
>> care about performance will have "batch" (or "writeout-only"), which we
>> can have OS-specific detections for.
>>
>> But really, all of the rest of this is unduly boxing us into
>> overconfiguration that I think nobody really needs.
>>
>
> We've gone over this a few times already, but just wanted to state it
> again.  The really detailed settings are really there for Git hosters
> like GitLab or GitHub. I'd be happy to remove the per-component
> core.fsync values from the documentation and leave just the ones we
> point the user to.

I'm prettty sure (but Patrick knows more) that GitLab's plan for this is
to keep it at whatever the safest setting is, presumably GitHub's as
well (but I don't know at all on that front).

>> If someone really needs this level of detail they can LD_PRELOAD
>> something to have fsync intercept fd's and paths, and act appropriately.
>>
>> Worse, as the RFC series I sent
>> (https://lore.kernel.org/git/RFC-cover-v2-0.7-00000000000-20220323T140753Z-avarab@gmail.com/)
>> shows we can and should "batch" up fsync() operations across these
>> configuration boundaries, which this level of configuration would seem
>> to preclude.
>>
>> Or, we'd need to explain why "core.fsync=loose-object" won't *actually*
>> call fsync() on a single loose object's fd under "batch" as I had to do
>> on top of this in
>> https://lore.kernel.org/git/RFC-patch-v2-6.7-c20301d7967-20220323T140753Z-avarab@gmail.com/
>>
>
> 99.9% of users don't care and won't look.  The ones who do look deeper
> and understand the issues have source code and access to this ML
> discussion to understand why this works this way.

Exactly, so we can hopefully have a simpler interface.

>> The same is going to apply for almost all of the rest of these
>> configuration categories.
>>
>> I.e. a natural follow-up to e.g. batching across objects & index as I'm
>> doing in
>> https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@gmail.com/
>> is to do likewise for all the PACK-related stuff before we rename it
>> in-place. Or even have "git gc" issue only a single fsync() for all of
>> PACKs, their metadata files, commit-graph etc., and then rename() things
>> in-place as appropriate afterwards.
>>
>> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
>> index 365a12dc7ae..536238e209b 100644
>> --- a/Documentation/config/core.txt
>> +++ b/Documentation/config/core.txt
>> @@ -548,49 +548,35 @@ core.whitespace::
>>    errors. The default tab width is 8. Allowed values are 1 to 63.
>>
>>  core.fsync::
>> -       A comma-separated list of components of the repository that
>> -       should be hardened via the core.fsyncMethod when created or
>> -       modified.  You can disable hardening of any component by
>> -       prefixing it with a '-'.  Items that are not hardened may be
>> -       lost in the event of an unclean system shutdown. Unless you
>> -       have special requirements, it is recommended that you leave
>> -       this option empty or pick one of `committed`, `added`,
>> -       or `all`.
>> -+
>> -When this configuration is encountered, the set of components starts with
>> -the platform default value, disabled components are removed, and additional
>> -components are added. `none` resets the state so that the platform default
>> -is ignored.
>> -+
>> -The empty string resets the fsync configuration to the platform
>> -default. The default on most platforms is equivalent to
>> -`core.fsync=committed,-loose-object`, which has good performance,
>> -but risks losing recent work in the event of an unclean system shutdown.
>> -+
>> -* `none` clears the set of fsynced components.
>> -* `loose-object` hardens objects added to the repo in loose-object form.
>> -* `pack` hardens objects added to the repo in packfile form.
>> -* `pack-metadata` hardens packfile bitmaps and indexes.
>> -* `commit-graph` hardens the commit graph file.
>> -* `index` hardens the index when it is modified.
>> -* `objects` is an aggregate option that is equivalent to
>> -  `loose-object,pack`.
>> -* `derived-metadata` is an aggregate option that is equivalent to
>> -  `pack-metadata,commit-graph`.
>> -* `committed` is an aggregate option that is currently equivalent to
>> -  `objects`. This mode sacrifices some performance to ensure that work
>> -  that is committed to the repository with `git commit` or similar commands
>> -  is hardened.
>> -* `added` is an aggregate option that is currently equivalent to
>> -  `committed,index`. This mode sacrifices additional performance to
>> -  ensure that the results of commands like `git add` and similar operations
>> -  are hardened.
>> -* `all` is an aggregate option that syncs all individual components above.
>> +       A boolen defaulting to `true`. To ensure data integrity git
>> +       will fsync() its objects, index and refu updates etc. This can
>> +       be set to `false` to disable `fsync()`-ing.
>> ++
>> +Only set this to `false` if you know what you're doing, and are
>> +prepared to deal with data corruption. Valid use-cases include
>> +throwaway uses of repositories on ramdisks, one-off mass-imports
>> +followed by calling `sync(1)` etc.
>> ++
>> +Note that the syncing of loose objects is currently excluded from
>> +`core.fsync=true`. To turn on all fsync-ing you'll need
>> +`core.fsync=true` and `core.fsyncObjectFiles=true`, but see
>> +`core.fsyncMethod=batch` below for a much faster alternative that's
>> +just as safe on various modern OS's.
>> ++
>> +The default is in flux and may change in the future, in particular the
>> +equivalent of the already-deprecated `core.fsyncObjectFiles` setting
>> +might soon default to `true`, and `core.fsyncMethod`'s default of
>> +`fsync` might default to a setting deemed to be safe on the local OS,
>> +suc has `batch` or `writeout-only`
>>
>>  core.fsyncMethod::
>>         A value indicating the strategy Git will use to harden repository data
>>         using fsync and related primitives.
>>  +
>> +Defaults to `fsync`, but as discussed for `core.fsync` above might
>> +change to use one of the values below taking advantage of
>> +platform-specific "faster `fsync()`".
>> ++
>>  * `fsync` uses the fsync() system call or platform equivalents.
>>  * `writeout-only` issues pagecache writeback requests, but depending on the
>>    filesystem and storage hardware, data added to the repository may not be
>> @@ -680,8 +666,8 @@ backed up by any standard (e.g. POSIX), but worked in practice on some
>>  Linux setups.
>>  +
>>  Nowadays you should almost certainly want to use
>> -`core.fsync=loose-object` instead in combination with
>> -`core.fsyncMethod=bulk`, and possibly with
>> +`core.fsync=true` instead in combination with
>> +`core.fsyncMethod=batch`, and possibly with
>>  `fsyncMethod.batch.quarantine=true`, see above. On modern OS's (Linux,
>>  OSX, Windows) that gives you most of the performance benefit of
>>  `core.fsyncObjectFiles=false` with all of the safety of the old
>
> I'm at the point where I don't want to endlessly revisit this discussion.

Sorry, my intention isn't to frustrate you, but I do think it's
important to get this right.

Particularly since this is now in "next", and we're getting closer to a
release. We can either talk about this now and decide on something, or
it'll be in a release, and then publicly documented promises will be
harder to back out of.

I think your suggestion of just hiding the relevant documentation would
be a good band-aid solution to that.

But I also think that given how I was altering this in my RFC series
that the premise of how this could be structured has been called into
question in a way that we didn't (or I don't recall) us having discussed
before.

I.e. that we can say "sync loose, but not index", or "sync index, but
not loose" with this config schema. When with "bulk" we it really isn't
any more expensive to do both if one is true (even cheaper, actually).
Junio C Hamano March 26, 2022, 1:23 a.m. UTC | #3
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> We've gone over this a few times already, but just wanted to state it
>> again.  The really detailed settings are really there for Git hosters
>> like GitLab or GitHub. I'd be happy to remove the per-component
>> core.fsync values from the documentation and leave just the ones we
>> point the user to.
>
> I'm prettty sure (but Patrick knows more) that GitLab's plan for this is
> to keep it at whatever the safest setting is, presumably GitHub's as
> well (but I don't know at all on that front).

I thought we've already settled it long ago, and it wasn't like you
were taking vacation from the list for a few weeks.  Why are we
talking about this again now?  Did we discover any new argument
against it?

Puzzled.
Neeraj Singh March 26, 2022, 1:25 a.m. UTC | #4
On Fri, Mar 25, 2022 at 5:33 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Fri, Mar 25 2022, Neeraj Singh wrote:
>
> > On Wed, Mar 23, 2022 at 7:46 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >>
> >> On Tue, Mar 15 2022, Neeraj Singh wrote:
> >>
> >> I know this is probably 80% my fault by egging you on about initially
> >> adding the wildmatch() based thing you didn't go for.
> >>
> >> But having looked at this with fresh eyes quite deeply I really think
> >> we're severely over-configuring things here:
> >>
> >> > +core.fsync::
> >> > +     A comma-separated list of components of the repository that
> >> > +     should be hardened via the core.fsyncMethod when created or
> >> > +     modified.  You can disable hardening of any component by
> >> > +     prefixing it with a '-'.  Items that are not hardened may be
> >> > +     lost in the event of an unclean system shutdown. Unless you
> >> > +     have special requirements, it is recommended that you leave
> >> > +     this option empty or pick one of `committed`, `added`,
> >> > +     or `all`.
> >> > ++
> >> > +When this configuration is encountered, the set of components starts with
> >> > +the platform default value, disabled components are removed, and additional
> >> > +components are added. `none` resets the state so that the platform default
> >> > +is ignored.
> >> > ++
> >> > +The empty string resets the fsync configuration to the platform
> >> > +default. The default on most platforms is equivalent to
> >> > +`core.fsync=committed,-loose-object`, which has good performance,
> >> > +but risks losing recent work in the event of an unclean system shutdown.
> >> > ++
> >> > +* `none` clears the set of fsynced components.
> >> > +* `loose-object` hardens objects added to the repo in loose-object form.
> >> > +* `pack` hardens objects added to the repo in packfile form.
> >> > +* `pack-metadata` hardens packfile bitmaps and indexes.
> >> > +* `commit-graph` hardens the commit graph file.
> >> > +* `index` hardens the index when it is modified.
> >> > +* `objects` is an aggregate option that is equivalent to
> >> > +  `loose-object,pack`.
> >> > +* `derived-metadata` is an aggregate option that is equivalent to
> >> > +  `pack-metadata,commit-graph`.
> >> > +* `committed` is an aggregate option that is currently equivalent to
> >> > +  `objects`. This mode sacrifices some performance to ensure that work
> >> > +  that is committed to the repository with `git commit` or similar commands
> >> > +  is hardened.
> >> > +* `added` is an aggregate option that is currently equivalent to
> >> > +  `committed,index`. This mode sacrifices additional performance to
> >> > +  ensure that the results of commands like `git add` and similar operations
> >> > +  are hardened.
> >> > +* `all` is an aggregate option that syncs all individual components above.
> >> > +
> >> >  core.fsyncMethod::
> >> >       A value indicating the strategy Git will use to harden repository data
> >> >       using fsync and related primitives.
> >>
> >> On top of my
> >> https://lore.kernel.org/git/RFC-patch-v2-7.7-a5951366c6e-20220323T140753Z-avarab@gmail.com/
> >> which makes the tmp-objdir part of your not-in-next-just-seen follow-up
> >> series configurable via "fsyncMethod.batch.quarantine" I really think we
> >> should just go for something like the belwo patch (note that
> >> misspelled/mistook "bulk" for "batch" in that linked-t patch, fixed
> >> below.
> >>
> >> I.e. I think we should just do our default fsync() of everything, and
> >> probably SOON make the fsync-ing of loose objects the default. Those who
> >> care about performance will have "batch" (or "writeout-only"), which we
> >> can have OS-specific detections for.
> >>
> >> But really, all of the rest of this is unduly boxing us into
> >> overconfiguration that I think nobody really needs.
> >>
> >
> > We've gone over this a few times already, but just wanted to state it
> > again.  The really detailed settings are really there for Git hosters
> > like GitLab or GitHub. I'd be happy to remove the per-component
> > core.fsync values from the documentation and leave just the ones we
> > point the user to.
>
> I'm prettty sure (but Patrick knows more) that GitLab's plan for this is
> to keep it at whatever the safest setting is, presumably GitHub's as
> well (but I don't know at all on that front).
>
> >> If someone really needs this level of detail they can LD_PRELOAD
> >> something to have fsync intercept fd's and paths, and act appropriately.
> >>
> >> Worse, as the RFC series I sent
> >> (https://lore.kernel.org/git/RFC-cover-v2-0.7-00000000000-20220323T140753Z-avarab@gmail.com/)
> >> shows we can and should "batch" up fsync() operations across these
> >> configuration boundaries, which this level of configuration would seem
> >> to preclude.
> >>
> >> Or, we'd need to explain why "core.fsync=loose-object" won't *actually*
> >> call fsync() on a single loose object's fd under "batch" as I had to do
> >> on top of this in
> >> https://lore.kernel.org/git/RFC-patch-v2-6.7-c20301d7967-20220323T140753Z-avarab@gmail.com/
> >>
> >
> > 99.9% of users don't care and won't look.  The ones who do look deeper
> > and understand the issues have source code and access to this ML
> > discussion to understand why this works this way.
>
> Exactly, so we can hopefully have a simpler interface.
>
> >> The same is going to apply for almost all of the rest of these
> >> configuration categories.
> >>
> >> I.e. a natural follow-up to e.g. batching across objects & index as I'm
> >> doing in
> >> https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@gmail.com/
> >> is to do likewise for all the PACK-related stuff before we rename it
> >> in-place. Or even have "git gc" issue only a single fsync() for all of
> >> PACKs, their metadata files, commit-graph etc., and then rename() things
> >> in-place as appropriate afterwards.
> >>
> >> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> >> index 365a12dc7ae..536238e209b 100644
> >> --- a/Documentation/config/core.txt
> >> +++ b/Documentation/config/core.txt
> >> @@ -548,49 +548,35 @@ core.whitespace::
> >>    errors. The default tab width is 8. Allowed values are 1 to 63.
> >>
> >>  core.fsync::
> >> -       A comma-separated list of components of the repository that
> >> -       should be hardened via the core.fsyncMethod when created or
> >> -       modified.  You can disable hardening of any component by
> >> -       prefixing it with a '-'.  Items that are not hardened may be
> >> -       lost in the event of an unclean system shutdown. Unless you
> >> -       have special requirements, it is recommended that you leave
> >> -       this option empty or pick one of `committed`, `added`,
> >> -       or `all`.
> >> -+
> >> -When this configuration is encountered, the set of components starts with
> >> -the platform default value, disabled components are removed, and additional
> >> -components are added. `none` resets the state so that the platform default
> >> -is ignored.
> >> -+
> >> -The empty string resets the fsync configuration to the platform
> >> -default. The default on most platforms is equivalent to
> >> -`core.fsync=committed,-loose-object`, which has good performance,
> >> -but risks losing recent work in the event of an unclean system shutdown.
> >> -+
> >> -* `none` clears the set of fsynced components.
> >> -* `loose-object` hardens objects added to the repo in loose-object form.
> >> -* `pack` hardens objects added to the repo in packfile form.
> >> -* `pack-metadata` hardens packfile bitmaps and indexes.
> >> -* `commit-graph` hardens the commit graph file.
> >> -* `index` hardens the index when it is modified.
> >> -* `objects` is an aggregate option that is equivalent to
> >> -  `loose-object,pack`.
> >> -* `derived-metadata` is an aggregate option that is equivalent to
> >> -  `pack-metadata,commit-graph`.
> >> -* `committed` is an aggregate option that is currently equivalent to
> >> -  `objects`. This mode sacrifices some performance to ensure that work
> >> -  that is committed to the repository with `git commit` or similar commands
> >> -  is hardened.
> >> -* `added` is an aggregate option that is currently equivalent to
> >> -  `committed,index`. This mode sacrifices additional performance to
> >> -  ensure that the results of commands like `git add` and similar operations
> >> -  are hardened.
> >> -* `all` is an aggregate option that syncs all individual components above.
> >> +       A boolen defaulting to `true`. To ensure data integrity git
> >> +       will fsync() its objects, index and refu updates etc. This can
> >> +       be set to `false` to disable `fsync()`-ing.
> >> ++
> >> +Only set this to `false` if you know what you're doing, and are
> >> +prepared to deal with data corruption. Valid use-cases include
> >> +throwaway uses of repositories on ramdisks, one-off mass-imports
> >> +followed by calling `sync(1)` etc.
> >> ++
> >> +Note that the syncing of loose objects is currently excluded from
> >> +`core.fsync=true`. To turn on all fsync-ing you'll need
> >> +`core.fsync=true` and `core.fsyncObjectFiles=true`, but see
> >> +`core.fsyncMethod=batch` below for a much faster alternative that's
> >> +just as safe on various modern OS's.
> >> ++
> >> +The default is in flux and may change in the future, in particular the
> >> +equivalent of the already-deprecated `core.fsyncObjectFiles` setting
> >> +might soon default to `true`, and `core.fsyncMethod`'s default of
> >> +`fsync` might default to a setting deemed to be safe on the local OS,
> >> +suc has `batch` or `writeout-only`
> >>
> >>  core.fsyncMethod::
> >>         A value indicating the strategy Git will use to harden repository data
> >>         using fsync and related primitives.
> >>  +
> >> +Defaults to `fsync`, but as discussed for `core.fsync` above might
> >> +change to use one of the values below taking advantage of
> >> +platform-specific "faster `fsync()`".
> >> ++
> >>  * `fsync` uses the fsync() system call or platform equivalents.
> >>  * `writeout-only` issues pagecache writeback requests, but depending on the
> >>    filesystem and storage hardware, data added to the repository may not be
> >> @@ -680,8 +666,8 @@ backed up by any standard (e.g. POSIX), but worked in practice on some
> >>  Linux setups.
> >>  +
> >>  Nowadays you should almost certainly want to use
> >> -`core.fsync=loose-object` instead in combination with
> >> -`core.fsyncMethod=bulk`, and possibly with
> >> +`core.fsync=true` instead in combination with
> >> +`core.fsyncMethod=batch`, and possibly with
> >>  `fsyncMethod.batch.quarantine=true`, see above. On modern OS's (Linux,
> >>  OSX, Windows) that gives you most of the performance benefit of
> >>  `core.fsyncObjectFiles=false` with all of the safety of the old
> >
> > I'm at the point where I don't want to endlessly revisit this discussion.
>
> Sorry, my intention isn't to frustrate you, but I do think it's
> important to get this right.
>
> Particularly since this is now in "next", and we're getting closer to a
> release. We can either talk about this now and decide on something, or
> it'll be in a release, and then publicly documented promises will be
> harder to back out of.
>
> I think your suggestion of just hiding the relevant documentation would
> be a good band-aid solution to that.
>
> But I also think that given how I was altering this in my RFC series
> that the premise of how this could be structured has been called into
> question in a way that we didn't (or I don't recall) us having discussed
> before.
>
> I.e. that we can say "sync loose, but not index", or "sync index, but
> not loose" with this config schema. When with "bulk" we it really isn't
> any more expensive to do both if one is true (even cheaper, actually).
>

I want to make a comment about the Index here.  Syncing the index is
strictly required for the "added" level of consistency, so that we
don't lose stuff that leaves the work tree but is staged.  But my
Windows enlistment has an index that's 266MB, which would be painful
to sync even with all the optimizations.  Maybe with split-index, this
wouldn't be so bad, but I just wanted to call out that some advanced
users may really care about the configurability.

As Git's various database implementations improve, the fsync stuff
will hopefully be more optimal and self-tuning.  But as that happens,
Git could just start ignoring settings that lose meaning without tying
anyones hands.
Ævar Arnfjörð Bjarmason March 26, 2022, 3:31 p.m. UTC | #5
On Fri, Mar 25 2022, Neeraj Singh wrote:

> On Fri, Mar 25, 2022 at 5:33 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Fri, Mar 25 2022, Neeraj Singh wrote:
>>
>> > On Wed, Mar 23, 2022 at 7:46 AM Ævar Arnfjörð Bjarmason
>> > <avarab@gmail.com> wrote:
>> >>
>> >>
>> >> On Tue, Mar 15 2022, Neeraj Singh wrote:
>> >>
>> >> I know this is probably 80% my fault by egging you on about initially
>> >> adding the wildmatch() based thing you didn't go for.
>> >>
>> >> But having looked at this with fresh eyes quite deeply I really think
>> >> we're severely over-configuring things here:
>> >>
>> >> > +core.fsync::
>> >> > +     A comma-separated list of components of the repository that
>> >> > +     should be hardened via the core.fsyncMethod when created or
>> >> > +     modified.  You can disable hardening of any component by
>> >> > +     prefixing it with a '-'.  Items that are not hardened may be
>> >> > +     lost in the event of an unclean system shutdown. Unless you
>> >> > +     have special requirements, it is recommended that you leave
>> >> > +     this option empty or pick one of `committed`, `added`,
>> >> > +     or `all`.
>> >> > ++
>> >> > +When this configuration is encountered, the set of components starts with
>> >> > +the platform default value, disabled components are removed, and additional
>> >> > +components are added. `none` resets the state so that the platform default
>> >> > +is ignored.
>> >> > ++
>> >> > +The empty string resets the fsync configuration to the platform
>> >> > +default. The default on most platforms is equivalent to
>> >> > +`core.fsync=committed,-loose-object`, which has good performance,
>> >> > +but risks losing recent work in the event of an unclean system shutdown.
>> >> > ++
>> >> > +* `none` clears the set of fsynced components.
>> >> > +* `loose-object` hardens objects added to the repo in loose-object form.
>> >> > +* `pack` hardens objects added to the repo in packfile form.
>> >> > +* `pack-metadata` hardens packfile bitmaps and indexes.
>> >> > +* `commit-graph` hardens the commit graph file.
>> >> > +* `index` hardens the index when it is modified.
>> >> > +* `objects` is an aggregate option that is equivalent to
>> >> > +  `loose-object,pack`.
>> >> > +* `derived-metadata` is an aggregate option that is equivalent to
>> >> > +  `pack-metadata,commit-graph`.
>> >> > +* `committed` is an aggregate option that is currently equivalent to
>> >> > +  `objects`. This mode sacrifices some performance to ensure that work
>> >> > +  that is committed to the repository with `git commit` or similar commands
>> >> > +  is hardened.
>> >> > +* `added` is an aggregate option that is currently equivalent to
>> >> > +  `committed,index`. This mode sacrifices additional performance to
>> >> > +  ensure that the results of commands like `git add` and similar operations
>> >> > +  are hardened.
>> >> > +* `all` is an aggregate option that syncs all individual components above.
>> >> > +
>> >> >  core.fsyncMethod::
>> >> >       A value indicating the strategy Git will use to harden repository data
>> >> >       using fsync and related primitives.
>> >>
>> >> On top of my
>> >> https://lore.kernel.org/git/RFC-patch-v2-7.7-a5951366c6e-20220323T140753Z-avarab@gmail.com/
>> >> which makes the tmp-objdir part of your not-in-next-just-seen follow-up
>> >> series configurable via "fsyncMethod.batch.quarantine" I really think we
>> >> should just go for something like the belwo patch (note that
>> >> misspelled/mistook "bulk" for "batch" in that linked-t patch, fixed
>> >> below.
>> >>
>> >> I.e. I think we should just do our default fsync() of everything, and
>> >> probably SOON make the fsync-ing of loose objects the default. Those who
>> >> care about performance will have "batch" (or "writeout-only"), which we
>> >> can have OS-specific detections for.
>> >>
>> >> But really, all of the rest of this is unduly boxing us into
>> >> overconfiguration that I think nobody really needs.
>> >>
>> >
>> > We've gone over this a few times already, but just wanted to state it
>> > again.  The really detailed settings are really there for Git hosters
>> > like GitLab or GitHub. I'd be happy to remove the per-component
>> > core.fsync values from the documentation and leave just the ones we
>> > point the user to.
>>
>> I'm prettty sure (but Patrick knows more) that GitLab's plan for this is
>> to keep it at whatever the safest setting is, presumably GitHub's as
>> well (but I don't know at all on that front).
>>
>> >> If someone really needs this level of detail they can LD_PRELOAD
>> >> something to have fsync intercept fd's and paths, and act appropriately.
>> >>
>> >> Worse, as the RFC series I sent
>> >> (https://lore.kernel.org/git/RFC-cover-v2-0.7-00000000000-20220323T140753Z-avarab@gmail.com/)
>> >> shows we can and should "batch" up fsync() operations across these
>> >> configuration boundaries, which this level of configuration would seem
>> >> to preclude.
>> >>
>> >> Or, we'd need to explain why "core.fsync=loose-object" won't *actually*
>> >> call fsync() on a single loose object's fd under "batch" as I had to do
>> >> on top of this in
>> >> https://lore.kernel.org/git/RFC-patch-v2-6.7-c20301d7967-20220323T140753Z-avarab@gmail.com/
>> >>
>> >
>> > 99.9% of users don't care and won't look.  The ones who do look deeper
>> > and understand the issues have source code and access to this ML
>> > discussion to understand why this works this way.
>>
>> Exactly, so we can hopefully have a simpler interface.
>>
>> >> The same is going to apply for almost all of the rest of these
>> >> configuration categories.
>> >>
>> >> I.e. a natural follow-up to e.g. batching across objects & index as I'm
>> >> doing in
>> >> https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@gmail.com/
>> >> is to do likewise for all the PACK-related stuff before we rename it
>> >> in-place. Or even have "git gc" issue only a single fsync() for all of
>> >> PACKs, their metadata files, commit-graph etc., and then rename() things
>> >> in-place as appropriate afterwards.
>> >>
>> >> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
>> >> index 365a12dc7ae..536238e209b 100644
>> >> --- a/Documentation/config/core.txt
>> >> +++ b/Documentation/config/core.txt
>> >> @@ -548,49 +548,35 @@ core.whitespace::
>> >>    errors. The default tab width is 8. Allowed values are 1 to 63.
>> >>
>> >>  core.fsync::
>> >> -       A comma-separated list of components of the repository that
>> >> -       should be hardened via the core.fsyncMethod when created or
>> >> -       modified.  You can disable hardening of any component by
>> >> -       prefixing it with a '-'.  Items that are not hardened may be
>> >> -       lost in the event of an unclean system shutdown. Unless you
>> >> -       have special requirements, it is recommended that you leave
>> >> -       this option empty or pick one of `committed`, `added`,
>> >> -       or `all`.
>> >> -+
>> >> -When this configuration is encountered, the set of components starts with
>> >> -the platform default value, disabled components are removed, and additional
>> >> -components are added. `none` resets the state so that the platform default
>> >> -is ignored.
>> >> -+
>> >> -The empty string resets the fsync configuration to the platform
>> >> -default. The default on most platforms is equivalent to
>> >> -`core.fsync=committed,-loose-object`, which has good performance,
>> >> -but risks losing recent work in the event of an unclean system shutdown.
>> >> -+
>> >> -* `none` clears the set of fsynced components.
>> >> -* `loose-object` hardens objects added to the repo in loose-object form.
>> >> -* `pack` hardens objects added to the repo in packfile form.
>> >> -* `pack-metadata` hardens packfile bitmaps and indexes.
>> >> -* `commit-graph` hardens the commit graph file.
>> >> -* `index` hardens the index when it is modified.
>> >> -* `objects` is an aggregate option that is equivalent to
>> >> -  `loose-object,pack`.
>> >> -* `derived-metadata` is an aggregate option that is equivalent to
>> >> -  `pack-metadata,commit-graph`.
>> >> -* `committed` is an aggregate option that is currently equivalent to
>> >> -  `objects`. This mode sacrifices some performance to ensure that work
>> >> -  that is committed to the repository with `git commit` or similar commands
>> >> -  is hardened.
>> >> -* `added` is an aggregate option that is currently equivalent to
>> >> -  `committed,index`. This mode sacrifices additional performance to
>> >> -  ensure that the results of commands like `git add` and similar operations
>> >> -  are hardened.
>> >> -* `all` is an aggregate option that syncs all individual components above.
>> >> +       A boolen defaulting to `true`. To ensure data integrity git
>> >> +       will fsync() its objects, index and refu updates etc. This can
>> >> +       be set to `false` to disable `fsync()`-ing.
>> >> ++
>> >> +Only set this to `false` if you know what you're doing, and are
>> >> +prepared to deal with data corruption. Valid use-cases include
>> >> +throwaway uses of repositories on ramdisks, one-off mass-imports
>> >> +followed by calling `sync(1)` etc.
>> >> ++
>> >> +Note that the syncing of loose objects is currently excluded from
>> >> +`core.fsync=true`. To turn on all fsync-ing you'll need
>> >> +`core.fsync=true` and `core.fsyncObjectFiles=true`, but see
>> >> +`core.fsyncMethod=batch` below for a much faster alternative that's
>> >> +just as safe on various modern OS's.
>> >> ++
>> >> +The default is in flux and may change in the future, in particular the
>> >> +equivalent of the already-deprecated `core.fsyncObjectFiles` setting
>> >> +might soon default to `true`, and `core.fsyncMethod`'s default of
>> >> +`fsync` might default to a setting deemed to be safe on the local OS,
>> >> +suc has `batch` or `writeout-only`
>> >>
>> >>  core.fsyncMethod::
>> >>         A value indicating the strategy Git will use to harden repository data
>> >>         using fsync and related primitives.
>> >>  +
>> >> +Defaults to `fsync`, but as discussed for `core.fsync` above might
>> >> +change to use one of the values below taking advantage of
>> >> +platform-specific "faster `fsync()`".
>> >> ++
>> >>  * `fsync` uses the fsync() system call or platform equivalents.
>> >>  * `writeout-only` issues pagecache writeback requests, but depending on the
>> >>    filesystem and storage hardware, data added to the repository may not be
>> >> @@ -680,8 +666,8 @@ backed up by any standard (e.g. POSIX), but worked in practice on some
>> >>  Linux setups.
>> >>  +
>> >>  Nowadays you should almost certainly want to use
>> >> -`core.fsync=loose-object` instead in combination with
>> >> -`core.fsyncMethod=bulk`, and possibly with
>> >> +`core.fsync=true` instead in combination with
>> >> +`core.fsyncMethod=batch`, and possibly with
>> >>  `fsyncMethod.batch.quarantine=true`, see above. On modern OS's (Linux,
>> >>  OSX, Windows) that gives you most of the performance benefit of
>> >>  `core.fsyncObjectFiles=false` with all of the safety of the old
>> >
>> > I'm at the point where I don't want to endlessly revisit this discussion.
>>
>> Sorry, my intention isn't to frustrate you, but I do think it's
>> important to get this right.
>>
>> Particularly since this is now in "next", and we're getting closer to a
>> release. We can either talk about this now and decide on something, or
>> it'll be in a release, and then publicly documented promises will be
>> harder to back out of.
>>
>> I think your suggestion of just hiding the relevant documentation would
>> be a good band-aid solution to that.
>>
>> But I also think that given how I was altering this in my RFC series
>> that the premise of how this could be structured has been called into
>> question in a way that we didn't (or I don't recall) us having discussed
>> before.
>>
>> I.e. that we can say "sync loose, but not index", or "sync index, but
>> not loose" with this config schema. When with "bulk" we it really isn't
>> any more expensive to do both if one is true (even cheaper, actually).
>>
>
> I want to make a comment about the Index here.  Syncing the index is
> strictly required for the "added" level of consistency, so that we
> don't lose stuff that leaves the work tree but is staged.  But my
> Windows enlistment has an index that's 266MB, which would be painful
> to sync even with all the optimizations.  Maybe with split-index, this
> wouldn't be so bad, but I just wanted to call out that some advanced
> users may really care about the configurability.

So for that use-case you'd like to fsync the loose objects (if any), but
not the index? So the FS will "flush" up to the index, and then queue
the index for later syncing to platter?


But even in that case don't the settings need to be tied to one another,
because in the method=bulk sync=index && sync=!loose case wouldn't we be
syncing "loose" in any case?

> As Git's various database implementations improve, the fsync stuff
> will hopefully be more optimal and self-tuning.  But as that happens,
> Git could just start ignoring settings that lose meaning without tying
> anyones hands.

Yeah that would alleviate most of my concerns here, but the docs aren't
saying anything like that. Since you added them & they just landed, do
you mind doing a small follow-up where we e.g. say that these new
settings are "EXPERIMENTAL" or whatever, and subject to drastic change?
Neeraj Singh March 27, 2022, 5:27 a.m. UTC | #6
On Sat, Mar 26, 2022 at 8:34 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Fri, Mar 25 2022, Neeraj Singh wrote:
>
> > On Fri, Mar 25, 2022 at 5:33 PM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >>
> >> On Fri, Mar 25 2022, Neeraj Singh wrote:
> >>
> >> > On Wed, Mar 23, 2022 at 7:46 AM Ævar Arnfjörð Bjarmason
> >> > <avarab@gmail.com> wrote:
> >> >>
> >> >>
> >> >> On Tue, Mar 15 2022, Neeraj Singh wrote:
> >> >>
> >> >> I know this is probably 80% my fault by egging you on about initially
> >> >> adding the wildmatch() based thing you didn't go for.
> >> >>
> >> >> But having looked at this with fresh eyes quite deeply I really think
> >> >> we're severely over-configuring things here:
> >> >>
> >> >> > +core.fsync::
> >> >> > +     A comma-separated list of components of the repository that
> >> >> > +     should be hardened via the core.fsyncMethod when created or
> >> >> > +     modified.  You can disable hardening of any component by
> >> >> > +     prefixing it with a '-'.  Items that are not hardened may be
> >> >> > +     lost in the event of an unclean system shutdown. Unless you
> >> >> > +     have special requirements, it is recommended that you leave
> >> >> > +     this option empty or pick one of `committed`, `added`,
> >> >> > +     or `all`.
> >> >> > ++
> >> >> > +When this configuration is encountered, the set of components starts with
> >> >> > +the platform default value, disabled components are removed, and additional
> >> >> > +components are added. `none` resets the state so that the platform default
> >> >> > +is ignored.
> >> >> > ++
> >> >> > +The empty string resets the fsync configuration to the platform
> >> >> > +default. The default on most platforms is equivalent to
> >> >> > +`core.fsync=committed,-loose-object`, which has good performance,
> >> >> > +but risks losing recent work in the event of an unclean system shutdown.
> >> >> > ++
> >> >> > +* `none` clears the set of fsynced components.
> >> >> > +* `loose-object` hardens objects added to the repo in loose-object form.
> >> >> > +* `pack` hardens objects added to the repo in packfile form.
> >> >> > +* `pack-metadata` hardens packfile bitmaps and indexes.
> >> >> > +* `commit-graph` hardens the commit graph file.
> >> >> > +* `index` hardens the index when it is modified.
> >> >> > +* `objects` is an aggregate option that is equivalent to
> >> >> > +  `loose-object,pack`.
> >> >> > +* `derived-metadata` is an aggregate option that is equivalent to
> >> >> > +  `pack-metadata,commit-graph`.
> >> >> > +* `committed` is an aggregate option that is currently equivalent to
> >> >> > +  `objects`. This mode sacrifices some performance to ensure that work
> >> >> > +  that is committed to the repository with `git commit` or similar commands
> >> >> > +  is hardened.
> >> >> > +* `added` is an aggregate option that is currently equivalent to
> >> >> > +  `committed,index`. This mode sacrifices additional performance to
> >> >> > +  ensure that the results of commands like `git add` and similar operations
> >> >> > +  are hardened.
> >> >> > +* `all` is an aggregate option that syncs all individual components above.
> >> >> > +
> >> >> >  core.fsyncMethod::
> >> >> >       A value indicating the strategy Git will use to harden repository data
> >> >> >       using fsync and related primitives.
> >> >>
> >> >> On top of my
> >> >> https://lore.kernel.org/git/RFC-patch-v2-7.7-a5951366c6e-20220323T140753Z-avarab@gmail.com/
> >> >> which makes the tmp-objdir part of your not-in-next-just-seen follow-up
> >> >> series configurable via "fsyncMethod.batch.quarantine" I really think we
> >> >> should just go for something like the belwo patch (note that
> >> >> misspelled/mistook "bulk" for "batch" in that linked-t patch, fixed
> >> >> below.
> >> >>
> >> >> I.e. I think we should just do our default fsync() of everything, and
> >> >> probably SOON make the fsync-ing of loose objects the default. Those who
> >> >> care about performance will have "batch" (or "writeout-only"), which we
> >> >> can have OS-specific detections for.
> >> >>
> >> >> But really, all of the rest of this is unduly boxing us into
> >> >> overconfiguration that I think nobody really needs.
> >> >>
> >> >
> >> > We've gone over this a few times already, but just wanted to state it
> >> > again.  The really detailed settings are really there for Git hosters
> >> > like GitLab or GitHub. I'd be happy to remove the per-component
> >> > core.fsync values from the documentation and leave just the ones we
> >> > point the user to.
> >>
> >> I'm prettty sure (but Patrick knows more) that GitLab's plan for this is
> >> to keep it at whatever the safest setting is, presumably GitHub's as
> >> well (but I don't know at all on that front).
> >>
> >> >> If someone really needs this level of detail they can LD_PRELOAD
> >> >> something to have fsync intercept fd's and paths, and act appropriately.
> >> >>
> >> >> Worse, as the RFC series I sent
> >> >> (https://lore.kernel.org/git/RFC-cover-v2-0.7-00000000000-20220323T140753Z-avarab@gmail.com/)
> >> >> shows we can and should "batch" up fsync() operations across these
> >> >> configuration boundaries, which this level of configuration would seem
> >> >> to preclude.
> >> >>
> >> >> Or, we'd need to explain why "core.fsync=loose-object" won't *actually*
> >> >> call fsync() on a single loose object's fd under "batch" as I had to do
> >> >> on top of this in
> >> >> https://lore.kernel.org/git/RFC-patch-v2-6.7-c20301d7967-20220323T140753Z-avarab@gmail.com/
> >> >>
> >> >
> >> > 99.9% of users don't care and won't look.  The ones who do look deeper
> >> > and understand the issues have source code and access to this ML
> >> > discussion to understand why this works this way.
> >>
> >> Exactly, so we can hopefully have a simpler interface.
> >>
> >> >> The same is going to apply for almost all of the rest of these
> >> >> configuration categories.
> >> >>
> >> >> I.e. a natural follow-up to e.g. batching across objects & index as I'm
> >> >> doing in
> >> >> https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@gmail.com/
> >> >> is to do likewise for all the PACK-related stuff before we rename it
> >> >> in-place. Or even have "git gc" issue only a single fsync() for all of
> >> >> PACKs, their metadata files, commit-graph etc., and then rename() things
> >> >> in-place as appropriate afterwards.
> >> >>
> >> >> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> >> >> index 365a12dc7ae..536238e209b 100644
> >> >> --- a/Documentation/config/core.txt
> >> >> +++ b/Documentation/config/core.txt
> >> >> @@ -548,49 +548,35 @@ core.whitespace::
> >> >>    errors. The default tab width is 8. Allowed values are 1 to 63.
> >> >>
> >> >>  core.fsync::
> >> >> -       A comma-separated list of components of the repository that
> >> >> -       should be hardened via the core.fsyncMethod when created or
> >> >> -       modified.  You can disable hardening of any component by
> >> >> -       prefixing it with a '-'.  Items that are not hardened may be
> >> >> -       lost in the event of an unclean system shutdown. Unless you
> >> >> -       have special requirements, it is recommended that you leave
> >> >> -       this option empty or pick one of `committed`, `added`,
> >> >> -       or `all`.
> >> >> -+
> >> >> -When this configuration is encountered, the set of components starts with
> >> >> -the platform default value, disabled components are removed, and additional
> >> >> -components are added. `none` resets the state so that the platform default
> >> >> -is ignored.
> >> >> -+
> >> >> -The empty string resets the fsync configuration to the platform
> >> >> -default. The default on most platforms is equivalent to
> >> >> -`core.fsync=committed,-loose-object`, which has good performance,
> >> >> -but risks losing recent work in the event of an unclean system shutdown.
> >> >> -+
> >> >> -* `none` clears the set of fsynced components.
> >> >> -* `loose-object` hardens objects added to the repo in loose-object form.
> >> >> -* `pack` hardens objects added to the repo in packfile form.
> >> >> -* `pack-metadata` hardens packfile bitmaps and indexes.
> >> >> -* `commit-graph` hardens the commit graph file.
> >> >> -* `index` hardens the index when it is modified.
> >> >> -* `objects` is an aggregate option that is equivalent to
> >> >> -  `loose-object,pack`.
> >> >> -* `derived-metadata` is an aggregate option that is equivalent to
> >> >> -  `pack-metadata,commit-graph`.
> >> >> -* `committed` is an aggregate option that is currently equivalent to
> >> >> -  `objects`. This mode sacrifices some performance to ensure that work
> >> >> -  that is committed to the repository with `git commit` or similar commands
> >> >> -  is hardened.
> >> >> -* `added` is an aggregate option that is currently equivalent to
> >> >> -  `committed,index`. This mode sacrifices additional performance to
> >> >> -  ensure that the results of commands like `git add` and similar operations
> >> >> -  are hardened.
> >> >> -* `all` is an aggregate option that syncs all individual components above.
> >> >> +       A boolen defaulting to `true`. To ensure data integrity git
> >> >> +       will fsync() its objects, index and refu updates etc. This can
> >> >> +       be set to `false` to disable `fsync()`-ing.
> >> >> ++
> >> >> +Only set this to `false` if you know what you're doing, and are
> >> >> +prepared to deal with data corruption. Valid use-cases include
> >> >> +throwaway uses of repositories on ramdisks, one-off mass-imports
> >> >> +followed by calling `sync(1)` etc.
> >> >> ++
> >> >> +Note that the syncing of loose objects is currently excluded from
> >> >> +`core.fsync=true`. To turn on all fsync-ing you'll need
> >> >> +`core.fsync=true` and `core.fsyncObjectFiles=true`, but see
> >> >> +`core.fsyncMethod=batch` below for a much faster alternative that's
> >> >> +just as safe on various modern OS's.
> >> >> ++
> >> >> +The default is in flux and may change in the future, in particular the
> >> >> +equivalent of the already-deprecated `core.fsyncObjectFiles` setting
> >> >> +might soon default to `true`, and `core.fsyncMethod`'s default of
> >> >> +`fsync` might default to a setting deemed to be safe on the local OS,
> >> >> +suc has `batch` or `writeout-only`
> >> >>
> >> >>  core.fsyncMethod::
> >> >>         A value indicating the strategy Git will use to harden repository data
> >> >>         using fsync and related primitives.
> >> >>  +
> >> >> +Defaults to `fsync`, but as discussed for `core.fsync` above might
> >> >> +change to use one of the values below taking advantage of
> >> >> +platform-specific "faster `fsync()`".
> >> >> ++
> >> >>  * `fsync` uses the fsync() system call or platform equivalents.
> >> >>  * `writeout-only` issues pagecache writeback requests, but depending on the
> >> >>    filesystem and storage hardware, data added to the repository may not be
> >> >> @@ -680,8 +666,8 @@ backed up by any standard (e.g. POSIX), but worked in practice on some
> >> >>  Linux setups.
> >> >>  +
> >> >>  Nowadays you should almost certainly want to use
> >> >> -`core.fsync=loose-object` instead in combination with
> >> >> -`core.fsyncMethod=bulk`, and possibly with
> >> >> +`core.fsync=true` instead in combination with
> >> >> +`core.fsyncMethod=batch`, and possibly with
> >> >>  `fsyncMethod.batch.quarantine=true`, see above. On modern OS's (Linux,
> >> >>  OSX, Windows) that gives you most of the performance benefit of
> >> >>  `core.fsyncObjectFiles=false` with all of the safety of the old
> >> >
> >> > I'm at the point where I don't want to endlessly revisit this discussion.
> >>
> >> Sorry, my intention isn't to frustrate you, but I do think it's
> >> important to get this right.
> >>
> >> Particularly since this is now in "next", and we're getting closer to a
> >> release. We can either talk about this now and decide on something, or
> >> it'll be in a release, and then publicly documented promises will be
> >> harder to back out of.
> >>
> >> I think your suggestion of just hiding the relevant documentation would
> >> be a good band-aid solution to that.
> >>
> >> But I also think that given how I was altering this in my RFC series
> >> that the premise of how this could be structured has been called into
> >> question in a way that we didn't (or I don't recall) us having discussed
> >> before.
> >>
> >> I.e. that we can say "sync loose, but not index", or "sync index, but
> >> not loose" with this config schema. When with "bulk" we it really isn't
> >> any more expensive to do both if one is true (even cheaper, actually).
> >>
> >
> > I want to make a comment about the Index here.  Syncing the index is
> > strictly required for the "added" level of consistency, so that we
> > don't lose stuff that leaves the work tree but is staged.  But my
> > Windows enlistment has an index that's 266MB, which would be painful
> > to sync even with all the optimizations.  Maybe with split-index, this
> > wouldn't be so bad, but I just wanted to call out that some advanced
> > users may really care about the configurability.
>
> So for that use-case you'd like to fsync the loose objects (if any), but
> not the index? So the FS will "flush" up to the index, and then queue
> the index for later syncing to platter?
>
>
> But even in that case don't the settings need to be tied to one another,
> because in the method=bulk sync=index && sync=!loose case wouldn't we be
> syncing "loose" in any case?
>
> > As Git's various database implementations improve, the fsync stuff
> > will hopefully be more optimal and self-tuning.  But as that happens,
> > Git could just start ignoring settings that lose meaning without tying
> > anyones hands.
>
> Yeah that would alleviate most of my concerns here, but the docs aren't
> saying anything like that. Since you added them & they just landed, do
> you mind doing a small follow-up where we e.g. say that these new
> settings are "EXPERIMENTAL" or whatever, and subject to drastic change?

The doc is already pretty prescriptive.  It has this line at the end
of the first  paragraph:
"Unless you
have special requirements, it is recommended that you leave
this option empty or pick one of `committed`, `added`,
or `all`."

Those values are already designed to change as Git changes.
Ævar Arnfjörð Bjarmason March 27, 2022, 12:43 p.m. UTC | #7
On Sat, Mar 26 2022, Neeraj Singh wrote:

> On Sat, Mar 26, 2022 at 8:34 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Fri, Mar 25 2022, Neeraj Singh wrote:
>>
>> > On Fri, Mar 25, 2022 at 5:33 PM Ævar Arnfjörð Bjarmason
>> > <avarab@gmail.com> wrote:
[...]
>> > I want to make a comment about the Index here.  Syncing the index is
>> > strictly required for the "added" level of consistency, so that we
>> > don't lose stuff that leaves the work tree but is staged.  But my
>> > Windows enlistment has an index that's 266MB, which would be painful
>> > to sync even with all the optimizations.  Maybe with split-index, this
>> > wouldn't be so bad, but I just wanted to call out that some advanced
>> > users may really care about the configurability.
>>
>> So for that use-case you'd like to fsync the loose objects (if any), but
>> not the index? So the FS will "flush" up to the index, and then queue
>> the index for later syncing to platter?
>>
>>
>> But even in that case don't the settings need to be tied to one another,
>> because in the method=bulk sync=index && sync=!loose case wouldn't we be
>> syncing "loose" in any case?
>>
>> > As Git's various database implementations improve, the fsync stuff
>> > will hopefully be more optimal and self-tuning.  But as that happens,
>> > Git could just start ignoring settings that lose meaning without tying
>> > anyones hands.
>>
>> Yeah that would alleviate most of my concerns here, but the docs aren't
>> saying anything like that. Since you added them & they just landed, do
>> you mind doing a small follow-up where we e.g. say that these new
>> settings are "EXPERIMENTAL" or whatever, and subject to drastic change?
>
> The doc is already pretty prescriptive.  It has this line at the end
> of the first  paragraph:
> "Unless you
> have special requirements, it is recommended that you leave
> this option empty or pick one of `committed`, `added`,
> or `all`."
>
> Those values are already designed to change as Git changes.

I'm referring to the documentation as it stands not being marked as
experimental in the sense that we might decide to re-do this to a large
extent, i.e. something like the diff I suggested upthread in
https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/

So yes, I agree that it e.g. clearly states that you can add a new
core.git=foobar or whatever down the line, but it clearly doesn't
suggest that e.g. core.fsync might have boolean semantics in some later
version, or that the rest might simply be ignored, even if that
e.g. means that we wouldn't sync loose objects on
core.fsync=loose-object, as we'd just warn with a "we don't provide this
anymore".

Or do you disagree with that? IOW I mean that we'd do something like
this, either in docs or code:

diff --git a/config.c b/config.c
index 3c9b6b589ab..94548566073 100644
--- a/config.c
+++ b/config.c
@@ -1675,6 +1675,9 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
 	}
 
 	if (!strcmp(var, "core.fsync")) {
+		if (!the_repository->settings.feature_experimental)
+			warning(_("the '%s' configuration option is EXPERIMENTAL. opt-in to use it with feature.experimental=true"),
+				var);
 		if (!value)
 			return config_error_nonbool(var);
 		fsync_components = parse_fsync_components(var, value);
@@ -1682,6 +1685,9 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
 	}
 
 	if (!strcmp(var, "core.fsyncmethod")) {
+		if (!the_repository->settings.feature_experimental)
+			warning(_("the '%s' configuration option is EXPERIMENTAL. opt-in to use it with feature.experimental=true"),
+				var);
 		if (!value)
 			return config_error_nonbool(var);
 		if (!strcmp(value, "fsync"))
diff --git a/repo-settings.c b/repo-settings.c
index b4fbd16cdcc..f949b65b91e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -31,6 +31,7 @@ void prepare_repo_settings(struct repository *r)
 	/* Booleans config or default, cascades to other settings */
 	repo_cfg_bool(r, "feature.manyfiles", &manyfiles, 0);
 	repo_cfg_bool(r, "feature.experimental", &experimental, 0);
+	r->settings.feature_experimental = experimental;
 
 	/* Defaults modified by feature.* */
 	if (experimental) {
diff --git a/repository.h b/repository.h
index e29f361703d..db8f99a8989 100644
--- a/repository.h
+++ b/repository.h
@@ -28,6 +28,7 @@ enum fetch_negotiation_setting {
 struct repo_settings {
 	int initialized;
 
+	int feature_experimental;
 	int core_commit_graph;
 	int commit_graph_read_changed_paths;
 	int gc_write_commit_graph;
Patrick Steinhardt March 28, 2022, 10:56 a.m. UTC | #8
On Sun, Mar 27, 2022 at 02:43:48PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sat, Mar 26 2022, Neeraj Singh wrote:
> 
> > On Sat, Mar 26, 2022 at 8:34 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >>
> >> On Fri, Mar 25 2022, Neeraj Singh wrote:
> >>
> >> > On Fri, Mar 25, 2022 at 5:33 PM Ævar Arnfjörð Bjarmason
> >> > <avarab@gmail.com> wrote:
> [...]
> >> > I want to make a comment about the Index here.  Syncing the index is
> >> > strictly required for the "added" level of consistency, so that we
> >> > don't lose stuff that leaves the work tree but is staged.  But my
> >> > Windows enlistment has an index that's 266MB, which would be painful
> >> > to sync even with all the optimizations.  Maybe with split-index, this
> >> > wouldn't be so bad, but I just wanted to call out that some advanced
> >> > users may really care about the configurability.
> >>
> >> So for that use-case you'd like to fsync the loose objects (if any), but
> >> not the index? So the FS will "flush" up to the index, and then queue
> >> the index for later syncing to platter?
> >>
> >>
> >> But even in that case don't the settings need to be tied to one another,
> >> because in the method=bulk sync=index && sync=!loose case wouldn't we be
> >> syncing "loose" in any case?
> >>
> >> > As Git's various database implementations improve, the fsync stuff
> >> > will hopefully be more optimal and self-tuning.  But as that happens,
> >> > Git could just start ignoring settings that lose meaning without tying
> >> > anyones hands.
> >>
> >> Yeah that would alleviate most of my concerns here, but the docs aren't
> >> saying anything like that. Since you added them & they just landed, do
> >> you mind doing a small follow-up where we e.g. say that these new
> >> settings are "EXPERIMENTAL" or whatever, and subject to drastic change?
> >
> > The doc is already pretty prescriptive.  It has this line at the end
> > of the first  paragraph:
> > "Unless you
> > have special requirements, it is recommended that you leave
> > this option empty or pick one of `committed`, `added`,
> > or `all`."
> >
> > Those values are already designed to change as Git changes.
> 
> I'm referring to the documentation as it stands not being marked as
> experimental in the sense that we might decide to re-do this to a large
> extent, i.e. something like the diff I suggested upthread in
> https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/
> 
> So yes, I agree that it e.g. clearly states that you can add a new
> core.git=foobar or whatever down the line, but it clearly doesn't
> suggest that e.g. core.fsync might have boolean semantics in some later
> version, or that the rest might simply be ignored, even if that
> e.g. means that we wouldn't sync loose objects on
> core.fsync=loose-object, as we'd just warn with a "we don't provide this
> anymore".
> 
> Or do you disagree with that? IOW I mean that we'd do something like
> this, either in docs or code:
> 
> diff --git a/config.c b/config.c
> index 3c9b6b589ab..94548566073 100644
> --- a/config.c
> +++ b/config.c
> @@ -1675,6 +1675,9 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
>  	}
>  
>  	if (!strcmp(var, "core.fsync")) {
> +		if (!the_repository->settings.feature_experimental)
> +			warning(_("the '%s' configuration option is EXPERIMENTAL. opt-in to use it with feature.experimental=true"),
> +				var);
>  		if (!value)
>  			return config_error_nonbool(var);
>  		fsync_components = parse_fsync_components(var, value);
> @@ -1682,6 +1685,9 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
>  	}
>  
>  	if (!strcmp(var, "core.fsyncmethod")) {
> +		if (!the_repository->settings.feature_experimental)
> +			warning(_("the '%s' configuration option is EXPERIMENTAL. opt-in to use it with feature.experimental=true"),
> +				var);
>  		if (!value)
>  			return config_error_nonbool(var);
>  		if (!strcmp(value, "fsync"))

Let's please not tie this to `feature.experimental=true`. Setting that
option has unintended sideeffects and will also change defaults which we
may not want to have in production. I don't mind adding a warning in the
docs though that the specific items which can be configured may be
subject to change in the future.

At GitLab, we've got a three-step plan:

    1. We need to migrate to `core.fsync` in the first place. In order
       to not migrate and change behaviour at the same point in time we
       already benefit from the fine-grainedness of this config because
       we can simply say `core.fsync=loose-objects` and have the same
       behaviour as before with `core.fsyncLooseObjects=true`.

    2. We'll want to enable syncing of packfiles, which I think wasn't
       previously covered by `core.fsyncLooseobjects`.

    3. We'll add `refs` to also sync loose refs to disk.

So while the end result will be the same as `committed`, having this
level of control helps us to assess the impact in a nicer way by being
able to do this step by step with feature flags.

On the other hand, many of the other parts we don't really care about.
Auxiliary metadata like the commit-graph or pack indices are data that
can in the worst case be regenerated by us, so it's not clear to me
whether it makes to also enable fsyncing those in production.

So altogether, I agree with Neeraj: having the fine-grainedness greatly
helps us to roll out changes like this and be able to pick what we deem
to be important. Personally I would be fine with explicitly pointing out
that there are two groups of this config in our docs though:

    1. The "porcelain" group: "committed", "added", "all", "none". These
       are abstract groups whose behaviour should adapt as we change
       implementations, and are those that should typically be set by a
       user, if intended.

    2. The "plumbing" or "expert" group: these are fine-grained options
       which shouldn't typically be used by Git users. They still have
       merit though in hosting environments, where requirements are
       typically a lot more specific.

We may also provide different guarantees for both groups. The first one
should definitely be stable, but we might state that the second group is
subject to change in the future.

Patrick
Ævar Arnfjörð Bjarmason March 28, 2022, 11:25 a.m. UTC | #9
On Mon, Mar 28 2022, Patrick Steinhardt wrote:

> [[PGP Signed Part:Undecided]]
> On Sun, Mar 27, 2022 at 02:43:48PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Sat, Mar 26 2022, Neeraj Singh wrote:
>> 
>> > On Sat, Mar 26, 2022 at 8:34 AM Ævar Arnfjörð Bjarmason
>> > <avarab@gmail.com> wrote:
>> >>
>> >>
>> >> On Fri, Mar 25 2022, Neeraj Singh wrote:
>> >>
>> >> > On Fri, Mar 25, 2022 at 5:33 PM Ævar Arnfjörð Bjarmason
>> >> > <avarab@gmail.com> wrote:
>> [...]
>> >> > I want to make a comment about the Index here.  Syncing the index is
>> >> > strictly required for the "added" level of consistency, so that we
>> >> > don't lose stuff that leaves the work tree but is staged.  But my
>> >> > Windows enlistment has an index that's 266MB, which would be painful
>> >> > to sync even with all the optimizations.  Maybe with split-index, this
>> >> > wouldn't be so bad, but I just wanted to call out that some advanced
>> >> > users may really care about the configurability.
>> >>
>> >> So for that use-case you'd like to fsync the loose objects (if any), but
>> >> not the index? So the FS will "flush" up to the index, and then queue
>> >> the index for later syncing to platter?
>> >>
>> >>
>> >> But even in that case don't the settings need to be tied to one another,
>> >> because in the method=bulk sync=index && sync=!loose case wouldn't we be
>> >> syncing "loose" in any case?
>> >>
>> >> > As Git's various database implementations improve, the fsync stuff
>> >> > will hopefully be more optimal and self-tuning.  But as that happens,
>> >> > Git could just start ignoring settings that lose meaning without tying
>> >> > anyones hands.
>> >>
>> >> Yeah that would alleviate most of my concerns here, but the docs aren't
>> >> saying anything like that. Since you added them & they just landed, do
>> >> you mind doing a small follow-up where we e.g. say that these new
>> >> settings are "EXPERIMENTAL" or whatever, and subject to drastic change?
>> >
>> > The doc is already pretty prescriptive.  It has this line at the end
>> > of the first  paragraph:
>> > "Unless you
>> > have special requirements, it is recommended that you leave
>> > this option empty or pick one of `committed`, `added`,
>> > or `all`."
>> >
>> > Those values are already designed to change as Git changes.
>> 
>> I'm referring to the documentation as it stands not being marked as
>> experimental in the sense that we might decide to re-do this to a large
>> extent, i.e. something like the diff I suggested upthread in
>> https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/
>> 
>> So yes, I agree that it e.g. clearly states that you can add a new
>> core.git=foobar or whatever down the line, but it clearly doesn't
>> suggest that e.g. core.fsync might have boolean semantics in some later
>> version, or that the rest might simply be ignored, even if that
>> e.g. means that we wouldn't sync loose objects on
>> core.fsync=loose-object, as we'd just warn with a "we don't provide this
>> anymore".
>> 
>> Or do you disagree with that? IOW I mean that we'd do something like
>> this, either in docs or code:
>> 
>> diff --git a/config.c b/config.c
>> index 3c9b6b589ab..94548566073 100644
>> --- a/config.c
>> +++ b/config.c
>> @@ -1675,6 +1675,9 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
>>  	}
>>  
>>  	if (!strcmp(var, "core.fsync")) {
>> +		if (!the_repository->settings.feature_experimental)
>> +			warning(_("the '%s' configuration option is EXPERIMENTAL. opt-in to use it with feature.experimental=true"),
>> +				var);
>>  		if (!value)
>>  			return config_error_nonbool(var);
>>  		fsync_components = parse_fsync_components(var, value);
>> @@ -1682,6 +1685,9 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
>>  	}
>>  
>>  	if (!strcmp(var, "core.fsyncmethod")) {
>> +		if (!the_repository->settings.feature_experimental)
>> +			warning(_("the '%s' configuration option is EXPERIMENTAL. opt-in to use it with feature.experimental=true"),
>> +				var);
>>  		if (!value)
>>  			return config_error_nonbool(var);
>>  		if (!strcmp(value, "fsync"))
>
> Let's please not tie this to `feature.experimental=true`. Setting that
> option has unintended sideeffects and will also change defaults which we
> may not want to have in production. I don't mind adding a warning in the
> docs though that the specific items which can be configured may be
> subject to change in the future.

Yes, that was a bad (throwaway) idea. I think probably any sort of
warning is over-doing it, but having the same in the docs would be good,
as in:

    git gre EXPERIMENTAL -- Documentation

> At GitLab, we've got a three-step plan:
>
>     1. We need to migrate to `core.fsync` in the first place. In order
>        to not migrate and change behaviour at the same point in time we
>        already benefit from the fine-grainedness of this config because
>        we can simply say `core.fsync=loose-objects` and have the same
>        behaviour as before with `core.fsyncLooseObjects=true`.

*nod*.

>     2. We'll want to enable syncing of packfiles, which I think wasn't
>        previously covered by `core.fsyncLooseobjects`.

We've always fsynced packfiles and other things that use the
finalize_hashfile() API. I.e. the pack metadata (idx,midx,bitmap etc.),
commit-graph etc.

Which is one thing I find a bit uncomfortable about the proposed config
schema, i.e. it's allowing *granular* unsafe behavior we didn't allow
before.

I think we *should* allow disabling fsync() entirely via:

    core.fsync=false

Per Eric Wong's [added to CC] proposal here:
https://lore.kernel.org/git/20211028002102.19384-1-e@80x24.org/; I think
that's useful for e.g. running git in CI, one off scripted mass-imports
where you run "sync(1)" after (or not...).

But I don't really see the use-case for turning off say "index" or
"pack-metadata", but otherwise keeping the default fsync().

>     3. We'll add `refs` to also sync loose refs to disk.

Maybe I'm reading this wrong, but AFAICT if you upgrade from pre-v2.36.0
to v2.36.0 you'll have no way to do fsync()-ing as it was done before
with your bc22d845c43 (core.fsync: new option to harden references,
2022-03-11).

I.e. I first thought you meant to start with:

    core.fsync=-loose-objects
    core.fsync=-reference

And then remove that "core.fsync=-reference" line to get the behavior
that'll be new in v2.36.0, but that won't do that. The new "reference"
category doesn't just affect loose refs, but all ref updates.

So we don't have any way to move to v2.36.0 and get exactly the fsync()
behavior we did before, or have I misread the code?

Now, I don't think we need it to be configurable at all.

I.e. I think your is a bc22d845c43 good change, but it feels weird to
make it and leave the default of core.fsyncLooseObjects=false on the
table. I.e. what you're summarizing there is true, but it's also true of
the loose objects.

To the extent that we've had any reason at all not to sync those by
default (which has really mostly been "we didn't re-visit it for a
while") it's been performance.

And both loose refs & loose objects will suffer from the same
degradation in performance from many fsync()'s.

IOW I think it's perfectly fine not to add a config knob for it other
than core.fsync=false, and VERY SOON turn on
core.fsyncLooseobjects=true, especially if we can get most of the
performance benefits with the "bulk" mode.

But why half-way with bc22d845c43? I mean, *that change* should be
narrow, but in terms of where we go next what do you think of the above?

> So while the end result will be the same as `committed`, having this
> level of control helps us to assess the impact in a nicer way by being
> able to do this step by step with feature flags.

*Nod*, although leaving aside the new syncing of loose refs the plan you
 outlined above could be done with the proposal of the simpler:

    core.fsync=true
    core.fsyncLooseObjects=false

> On the other hand, many of the other parts we don't really care about.
> Auxiliary metadata like the commit-graph or pack indices are data that
> can in the worst case be regenerated by us, so it's not clear to me
> whether it makes to also enable fsyncing those in production.

I'm not familiar with all of those in detail, i.e. how we behave
specifically in the face of them being corrupt.

The commit-graph I am, we *should* recover "gracefully" there, but you
might have an incident shortly there after due to "for-each-ref
--contains" slowdowns by 1000x or whatever.

For e.g. *.idx we'd be hosed until manual recovery.

So I think those area all in the same bucket as core.fsync=false, and
that we don't need the granularity.

> So altogether, I agree with Neeraj: having the fine-grainedness greatly
> helps us to roll out changes like this and be able to pick what we deem
> to be important. Personally I would be fine with explicitly pointing out
> that there are two groups of this config in our docs though:

Yes, that's fair. But pending replies to the above I think the main
point & proposal of us having too much config stands. I.e. depending on
what you want to do with loose object refs we'd just need this:

    core.fsync=[<bool>] # true by default
    core.fsyncLooseObjects=[<bool>] # false by default
    core.fsyncLooseRefs=[<bool>] # true by default?

And then the "bulk" config, which would be orthagonal to this.

I.e. do we really have a use-case for the rest of the kitchen sink?

>     1. The "porcelain" group: "committed", "added", "all", "none". These
>        are abstract groups whose behaviour should adapt as we change
>        implementations, and are those that should typically be set by a
>        user, if intended.
>
>     2. The "plumbing" or "expert" group: these are fine-grained options
>        which shouldn't typically be used by Git users. They still have
>        merit though in hosting environments, where requirements are
>        typically a lot more specific.
>
> We may also provide different guarantees for both groups. The first one
> should definitely be stable, but we might state that the second group is
> subject to change in the future.

I hope we can work something out :)

Overall: I think you've left one of the the main things I brought up[1]
unaddressed, i.e. that the core.fsync config schema in its current form
assumes that we can sync A or B, and configure those separately.

Which AFAIKT is because Neeraj's initial implementation & the discussion
was focused on finishing A or B with a per-"group" "cookie" to flush the
files.

But as [2] shows it's more performant for us to simply defer the fsync
of A until the committing of B.

Which is the main reason I think we should be re-visiting this. Sure, if
we were just syncing A, B or C having per-[ABC] config options might be
a bit overdoing it, but would be relatively simple.

But once we start using a more optimized version of the "bulk" mode the
config schema will be making promises about individual steps in a
transaction that I think we'll want to leave opaque, and only promise
that when git returns it will have synced all the relevant assets as
efficiently as possible.

1. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@gmail.com/
Neeraj Singh March 28, 2022, 7:56 p.m. UTC | #10
On Mon, Mar 28, 2022 at 5:15 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> I hope we can work something out :)
>
> Overall: I think you've left one of the the main things I brought up[1]
> unaddressed, i.e. that the core.fsync config schema in its current form
> assumes that we can sync A or B, and configure those separately.
>
> Which AFAIKT is because Neeraj's initial implementation & the discussion
> was focused on finishing A or B with a per-"group" "cookie" to flush the
> files.
>
> But as [2] shows it's more performant for us to simply defer the fsync
> of A until the committing of B.
>
> Which is the main reason I think we should be re-visiting this. Sure, if
> we were just syncing A, B or C having per-[ABC] config options might be
> a bit overdoing it, but would be relatively simple.
>
> But once we start using a more optimized version of the "bulk" mode the
> config schema will be making promises about individual steps in a
> transaction that I think we'll want to leave opaque, and only promise
> that when git returns it will have synced all the relevant assets as
> efficiently as possible.
>
> 1. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/RFC-patch-v2-4.7-61f4f3d7ef4-20220323T140753Z-avarab@gmail.com/

I think the current documentation is fine (obviously since I wrote
it).  Let's reproduce the first part again:
---
core.fsync::
A comma-separated list of components of the repository that
should be hardened via the core.fsyncMethod when created or
modified.  You can disable hardening of any component by
prefixing it with a '-'.  Items that are not hardened may be
lost in the event of an unclean system shutdown. Unless you
have special requirements, it is recommended that you leave
this option empty or pick one of `committed`, `added`,
or `all`.
+
When this configuration is encountered, the set of components starts with
the platform default value, disabled components are removed, and additional
components are added. `none` resets the state so that the platform default
is ignored.
+
The empty string resets the fsync configuration to the platform
default. The default on most platforms is equivalent to
`core.fsync=committed,-loose-object`, which has good performance,
but risks losing recent work in the event of an unclean system shutdown.
+
---

We're only talking about "hardening" parts of the repository, and we
say we'll do it using the "fsyncMethod".  If you don't harden
something, you could lose it if the system dies.  All of these
statements are true and don't say so much about the implementation of
how the hardening happens. It's perfectly valid to not force any
component out of the disk cache, and a straightforward implementation
of repo transactions can put the sync point anywhere. We also
explicitly point the user at a "porcelain" setting in the first
paragraph.

So I think you're alone in thinking that anything needs to change here.

Thanks,
Neeraj
Neeraj Singh March 30, 2022, 4:59 p.m. UTC | #11
On Mon, Mar 28, 2022 at 12:56 PM Neeraj Singh <nksingh85@gmail.com> wrote:
>
> On Mon, Mar 28, 2022 at 5:15 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>
> So I think you're alone in thinking that anything needs to change here.
>

Ævar,
I apologize for the negative tone and content of this comment.

Thanks,
Neeraj
diff mbox series

Patch

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index 365a12dc7ae..536238e209b 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -548,49 +548,35 @@  core.whitespace::
   errors. The default tab width is 8. Allowed values are 1 to 63.
 
 core.fsync::
-	A comma-separated list of components of the repository that
-	should be hardened via the core.fsyncMethod when created or
-	modified.  You can disable hardening of any component by
-	prefixing it with a '-'.  Items that are not hardened may be
-	lost in the event of an unclean	system shutdown. Unless you
-	have special requirements, it is recommended that you leave
-	this option empty or pick one of `committed`, `added`,
-	or `all`.
-+
-When this configuration is encountered, the set of components starts with
-the platform default value, disabled components are removed, and additional
-components are added. `none` resets the state so that the platform default
-is ignored.
-+
-The empty string resets the fsync configuration to the platform
-default. The default on most platforms is equivalent to
-`core.fsync=committed,-loose-object`, which has good performance,
-but risks losing recent work in the event of an unclean system shutdown.
-+
-* `none` clears the set of fsynced components.
-* `loose-object` hardens objects added to the repo in loose-object form.
-* `pack` hardens objects added to the repo in packfile form.
-* `pack-metadata` hardens packfile bitmaps and indexes.
-* `commit-graph` hardens the commit graph file.
-* `index` hardens the index when it is modified.
-* `objects` is an aggregate option that is equivalent to
-  `loose-object,pack`.
-* `derived-metadata` is an aggregate option that is equivalent to
-  `pack-metadata,commit-graph`.
-* `committed` is an aggregate option that is currently equivalent to
-  `objects`. This mode sacrifices some performance to ensure that work
-  that is committed to the repository with `git commit` or similar commands
-  is hardened.
-* `added` is an aggregate option that is currently equivalent to
-  `committed,index`. This mode sacrifices additional performance to
-  ensure that the results of commands like `git add` and similar operations
-  are hardened.
-* `all` is an aggregate option that syncs all individual components above.
+	A boolen defaulting to `true`. To ensure data integrity git
+	will fsync() its objects, index and refu updates etc. This can
+	be set to `false` to disable `fsync()`-ing.
++
+Only set this to `false` if you know what you're doing, and are
+prepared to deal with data corruption. Valid use-cases include
+throwaway uses of repositories on ramdisks, one-off mass-imports
+followed by calling `sync(1)` etc.
++
+Note that the syncing of loose objects is currently excluded from
+`core.fsync=true`. To turn on all fsync-ing you'll need
+`core.fsync=true` and `core.fsyncObjectFiles=true`, but see
+`core.fsyncMethod=batch` below for a much faster alternative that's
+just as safe on various modern OS's.
++
+The default is in flux and may change in the future, in particular the
+equivalent of the already-deprecated `core.fsyncObjectFiles` setting
+might soon default to `true`, and `core.fsyncMethod`'s default of
+`fsync` might default to a setting deemed to be safe on the local OS,
+suc has `batch` or `writeout-only`
 
 core.fsyncMethod::
 	A value indicating the strategy Git will use to harden repository data
 	using fsync and related primitives.
 +
+Defaults to `fsync`, but as discussed for `core.fsync` above might
+change to use one of the values below taking advantage of
+platform-specific "faster `fsync()`".
++
 * `fsync` uses the fsync() system call or platform equivalents.
 * `writeout-only` issues pagecache writeback requests, but depending on the
   filesystem and storage hardware, data added to the repository may not be
@@ -680,8 +666,8 @@  backed up by any standard (e.g. POSIX), but worked in practice on some
 Linux setups.
 +
 Nowadays you should almost certainly want to use
-`core.fsync=loose-object` instead in combination with
-`core.fsyncMethod=bulk`, and possibly with
+`core.fsync=true` instead in combination with
+`core.fsyncMethod=batch`, and possibly with
 `fsyncMethod.batch.quarantine=true`, see above. On modern OS's (Linux,
 OSX, Windows) that gives you most of the performance benefit of
 `core.fsyncObjectFiles=false` with all of the safety of the old