Message ID | pull.1076.v9.git.git.1637020263.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Implement a batched fsync option for core.fsyncObjectFiles | expand |
On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote: > * Per [2], I'm leaving the fsyncObjectFiles configuration as is with > 'true', 'false', and 'batch'. This makes using old and new versions of > git with 'batch' mode a little trickier, but hopefully people will > generally be moving forward in versions. > > [1] See > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/ > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/ I really think leaving that in-place is just being unnecessarily cavalier. There's a lot of mixed-version environments where git is deployed in, and we almost never break the configuration in this way (I think in the past always by mistake). In this case it's easy to avoid it, and coming up with a less narrow config model[1] seems like a good idea in any case to unify the various outstanding work in this area. More generally on this series, per the thread ending in [2] I really don't get why we have code like this: @@ -503,10 +504,12 @@ static void unpack_all(void) if (!quiet) progress = start_progress(_("Unpacking objects"), nr_objects); CALLOC_ARRAY(obj_list, nr_objects); + plug_bulk_checkin(); for (i = 0; i < nr_objects; i++) { unpack_one(i); display_progress(progress, i + 1); } + unplug_bulk_checkin(); stop_progress(&progress); if (delta_list) As opposed to doing an fsync on the last object we're processing. I.e. why do we need the step of intentionally making the objects unavailable in the tmp-objdir, and creating a "cookie" file to sync at the start/end, as opposed to fsyncing on the last file (which we're writing out anyway). 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/ 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/
On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote: > > > * Per [2], I'm leaving the fsyncObjectFiles configuration as is with > > 'true', 'false', and 'batch'. This makes using old and new versions of > > git with 'batch' mode a little trickier, but hopefully people will > > generally be moving forward in versions. > > > > [1] See > > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/ > > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/ > > I really think leaving that in-place is just being unnecessarily > cavalier. There's a lot of mixed-version environments where git is > deployed in, and we almost never break the configuration in this way (I > think in the past always by mistake). > In this case it's easy to avoid it, and coming up with a less narrow > config model[1] seems like a good idea in any case to unify the various > outstanding work in this area. > > More generally on this series, per the thread ending in [2] I really My primary goal in all of these changes is to move git-for-windows over to a default of batch fsync so that it can get closer to other platforms in performance of 'git add' while still retaining the same level of data integrity. I'm hoping that most end-users are just sticking to defaults here. I'm happy to change the configuration schema again if there's a consensus from the Git community that backwards-compatibility of the configuration is actually important to someone. Also, if we're doing a deeper rethink of the fsync configuration (as prompted by this work and Eric Wong's and Patrick Steinhardts work), do we want to retain a mode where we fsync some parts of the persistent repo data but not others? If we add fsyncing of the index in addition to the refs, I believe we would have covered all of the critical data structures that would be needed to find the data that a user has added to the repo if they complete a series of git commands and then experience a system crash. > don't get why we have code like this: > > @@ -503,10 +504,12 @@ static void unpack_all(void) > if (!quiet) > progress = start_progress(_("Unpacking objects"), nr_objects); > CALLOC_ARRAY(obj_list, nr_objects); > + plug_bulk_checkin(); > for (i = 0; i < nr_objects; i++) { > unpack_one(i); > display_progress(progress, i + 1); > } > + unplug_bulk_checkin(); > stop_progress(&progress); > > if (delta_list) > > As opposed to doing an fsync on the last object we're > processing. I.e. why do we need the step of intentionally making the > objects unavailable in the tmp-objdir, and creating a "cookie" file to > sync at the start/end, as opposed to fsyncing on the last file (which > we're writing out anyway). > > 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/ > 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/ It's important to not expose an object's final name until its contents have been fsynced to disk. We want to ensure that wherever we crash, we won't have a loose object that Git may later try to open where the filename doesn't match the content hash. I believe it's okay for a given OID to be missing, since a later command could recreate it, but an object with a wrong hash looks like it would persist until we do a git-fsck. I thought about figuring out how to sync the last object rather than some random "cookie" file, but it wasn't clear to me how I'd figure out which object is actually last from library code in a way that doesn't burden each command with somehow figuring out its last object and communicating that. The 'cookie' approach seems to lead to a cleaner interface for callers. Thanks, Neeraj
On Tue, Nov 16 2021, Neeraj Singh wrote: > On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> >> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote: >> >> > * Per [2], I'm leaving the fsyncObjectFiles configuration as is with >> > 'true', 'false', and 'batch'. This makes using old and new versions of >> > git with 'batch' mode a little trickier, but hopefully people will >> > generally be moving forward in versions. >> > >> > [1] See >> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/ >> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/ >> >> I really think leaving that in-place is just being unnecessarily >> cavalier. There's a lot of mixed-version environments where git is >> deployed in, and we almost never break the configuration in this way (I >> think in the past always by mistake). > >> In this case it's easy to avoid it, and coming up with a less narrow >> config model[1] seems like a good idea in any case to unify the various >> outstanding work in this area. >> >> More generally on this series, per the thread ending in [2] I really > > My primary goal in all of these changes is to move git-for-windows over to > a default of batch fsync so that it can get closer to other platforms > in performance > of 'git add' while still retaining the same level of data integrity. > I'm hoping that > most end-users are just sticking to defaults here. > > I'm happy to change the configuration schema again if there's a > consensus from the Git > community that backwards-compatibility of the configuration is > actually important to someone. > > Also, if we're doing a deeper rethink of the fsync configuration (as > prompted by this work and > Eric Wong's and Patrick Steinhardts work), do we want to retain a mode > where we fsync some > parts of the persistent repo data but not others? If we add fsyncing > of the index in addition to the refs, > I believe we would have covered all of the critical data structures > that would be needed to find the > data that a user has added to the repo if they complete a series of > git commands and then experience > a system crash. Just talking about it is how we'll find consensus, maybe you & Junio would like to keep it as-is. I don't see why we'd expose this bad edge case in configuration handling to users when it's entirely avoidable, and we're still in the design phase. >> don't get why we have code like this: >> >> @@ -503,10 +504,12 @@ static void unpack_all(void) >> if (!quiet) >> progress = start_progress(_("Unpacking objects"), nr_objects); >> CALLOC_ARRAY(obj_list, nr_objects); >> + plug_bulk_checkin(); >> for (i = 0; i < nr_objects; i++) { >> unpack_one(i); >> display_progress(progress, i + 1); >> } >> + unplug_bulk_checkin(); >> stop_progress(&progress); >> >> if (delta_list) >> >> As opposed to doing an fsync on the last object we're >> processing. I.e. why do we need the step of intentionally making the >> objects unavailable in the tmp-objdir, and creating a "cookie" file to >> sync at the start/end, as opposed to fsyncing on the last file (which >> we're writing out anyway). >> >> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/ >> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/ > > It's important to not expose an object's final name until its contents > have been fsynced > to disk. We want to ensure that wherever we crash, we won't have a > loose object that > Git may later try to open where the filename doesn't match the content > hash. I believe it's > okay for a given OID to be missing, since a later command could > recreate it, but an object > with a wrong hash looks like it would persist until we do a git-fsck. Yes, we handle that rather badly, as I mentioned in some other threads, but not doing the fsync on the last object v.s. a "cookie" file right afterwards seems like a hail-mary at best, no? > I thought about figuring out how to sync the last object rather than some random > "cookie" file, but it wasn't clear to me how I'd figure out which > object is actually last > from library code in a way that doesn't burden each command with > somehow figuring > out its last object and communicating that. The 'cookie' approach > seems to lead to a cleaner > interface for callers. The above quoted code is looping through nr_objects isn't it? Can't a "do fsync" be passed down to unpack_one() when we process the last loose object?
On Tue, Nov 16, 2021 at 11:28 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Tue, Nov 16 2021, Neeraj Singh wrote: > > > On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason > > <avarab@gmail.com> wrote: > >> > >> > >> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote: > >> > >> > * Per [2], I'm leaving the fsyncObjectFiles configuration as is with > >> > 'true', 'false', and 'batch'. This makes using old and new versions of > >> > git with 'batch' mode a little trickier, but hopefully people will > >> > generally be moving forward in versions. > >> > > >> > [1] See > >> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/ > >> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/ > >> > >> I really think leaving that in-place is just being unnecessarily > >> cavalier. There's a lot of mixed-version environments where git is > >> deployed in, and we almost never break the configuration in this way (I > >> think in the past always by mistake). > > > >> In this case it's easy to avoid it, and coming up with a less narrow > >> config model[1] seems like a good idea in any case to unify the various > >> outstanding work in this area. > >> > >> More generally on this series, per the thread ending in [2] I really > > > > My primary goal in all of these changes is to move git-for-windows over to > > a default of batch fsync so that it can get closer to other platforms > > in performance > > of 'git add' while still retaining the same level of data integrity. > > I'm hoping that > > most end-users are just sticking to defaults here. > > > > I'm happy to change the configuration schema again if there's a > > consensus from the Git > > community that backwards-compatibility of the configuration is > > actually important to someone. > > > > Also, if we're doing a deeper rethink of the fsync configuration (as > > prompted by this work and > > Eric Wong's and Patrick Steinhardts work), do we want to retain a mode > > where we fsync some > > parts of the persistent repo data but not others? If we add fsyncing > > of the index in addition to the refs, > > I believe we would have covered all of the critical data structures > > that would be needed to find the > > data that a user has added to the repo if they complete a series of > > git commands and then experience > > a system crash. > > Just talking about it is how we'll find consensus, maybe you & Junio > would like to keep it as-is. I don't see why we'd expose this bad edge > case in configuration handling to users when it's entirely avoidable, > and we're still in the design phase. After trying to figure out an implementation, I have a new proposal, which I've shared on the other thread [1]. [1] https://lore.kernel.org/git/CANQDOdcdhfGtPg0PxpXQA5gQ4x9VknKDKCCi4HEB0Z1xgnjKzg@mail.gmail.com/ > > >> don't get why we have code like this: > >> > >> @@ -503,10 +504,12 @@ static void unpack_all(void) > >> if (!quiet) > >> progress = start_progress(_("Unpacking objects"), nr_objects); > >> CALLOC_ARRAY(obj_list, nr_objects); > >> + plug_bulk_checkin(); > >> for (i = 0; i < nr_objects; i++) { > >> unpack_one(i); > >> display_progress(progress, i + 1); > >> } > >> + unplug_bulk_checkin(); > >> stop_progress(&progress); > >> > >> if (delta_list) > >> > >> As opposed to doing an fsync on the last object we're > >> processing. I.e. why do we need the step of intentionally making the > >> objects unavailable in the tmp-objdir, and creating a "cookie" file to > >> sync at the start/end, as opposed to fsyncing on the last file (which > >> we're writing out anyway). > >> > >> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/ > >> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/ > > > > It's important to not expose an object's final name until its contents > > have been fsynced > > to disk. We want to ensure that wherever we crash, we won't have a > > loose object that > > Git may later try to open where the filename doesn't match the content > > hash. I believe it's > > okay for a given OID to be missing, since a later command could > > recreate it, but an object > > with a wrong hash looks like it would persist until we do a git-fsck. > > Yes, we handle that rather badly, as I mentioned in some other threads, > but not doing the fsync on the last object v.s. a "cookie" file right > afterwards seems like a hail-mary at best, no? > I'm not quite grasping what you're saying here. Are you saying that using a dummy file instead of one of the actual objects is less likely to produce the desired outcome on actual filesystem implementations? > > I thought about figuring out how to sync the last object rather than some random > > "cookie" file, but it wasn't clear to me how I'd figure out which > > object is actually last > > from library code in a way that doesn't burden each command with > > somehow figuring > > out its last object and communicating that. The 'cookie' approach > > seems to lead to a cleaner > > interface for callers. > > The above quoted code is looping through nr_objects isn't it? Can't a > "do fsync" be passed down to unpack_one() when we process the last loose > object? Are you proposing that we do something different for unpack_objects versus update_index and git-add? I was hoping to keep all of the users of the batch fsync functionality equivalent. For the git-add workflow and update-index, we'd need to track the most recent file so that we can go back and fsync it. I don't believe that syncing the last object composes well with the existing implementation of those commands. Thanks, Neeraj
On Wed, Nov 17 2021, Neeraj Singh wrote: [Very late reply, sorry] > On Tue, Nov 16, 2021 at 11:28 PM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> >> On Tue, Nov 16 2021, Neeraj Singh wrote: >> >> > On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason >> > <avarab@gmail.com> wrote: >> >> >> >> >> >> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote: >> >> >> >> > * Per [2], I'm leaving the fsyncObjectFiles configuration as is with >> >> > 'true', 'false', and 'batch'. This makes using old and new versions of >> >> > git with 'batch' mode a little trickier, but hopefully people will >> >> > generally be moving forward in versions. >> >> > >> >> > [1] See >> >> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/ >> >> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/ >> >> >> >> I really think leaving that in-place is just being unnecessarily >> >> cavalier. There's a lot of mixed-version environments where git is >> >> deployed in, and we almost never break the configuration in this way (I >> >> think in the past always by mistake). >> > >> >> In this case it's easy to avoid it, and coming up with a less narrow >> >> config model[1] seems like a good idea in any case to unify the various >> >> outstanding work in this area. >> >> >> >> More generally on this series, per the thread ending in [2] I really >> > >> > My primary goal in all of these changes is to move git-for-windows over to >> > a default of batch fsync so that it can get closer to other platforms >> > in performance >> > of 'git add' while still retaining the same level of data integrity. >> > I'm hoping that >> > most end-users are just sticking to defaults here. >> > >> > I'm happy to change the configuration schema again if there's a >> > consensus from the Git >> > community that backwards-compatibility of the configuration is >> > actually important to someone. >> > >> > Also, if we're doing a deeper rethink of the fsync configuration (as >> > prompted by this work and >> > Eric Wong's and Patrick Steinhardts work), do we want to retain a mode >> > where we fsync some >> > parts of the persistent repo data but not others? If we add fsyncing >> > of the index in addition to the refs, >> > I believe we would have covered all of the critical data structures >> > that would be needed to find the >> > data that a user has added to the repo if they complete a series of >> > git commands and then experience >> > a system crash. >> >> Just talking about it is how we'll find consensus, maybe you & Junio >> would like to keep it as-is. I don't see why we'd expose this bad edge >> case in configuration handling to users when it's entirely avoidable, >> and we're still in the design phase. > > After trying to figure out an implementation, I have a new proposal, > which I've shared on the other thread [1]. > > [1] https://lore.kernel.org/git/CANQDOdcdhfGtPg0PxpXQA5gQ4x9VknKDKCCi4HEB0Z1xgnjKzg@mail.gmail.com/ This LGTM, or something simpler as Junio points out with his "too fine-grained?" comment as a follow-up. I'm honestly quite apathetic about what we end up with exactly as long as: 1. We get the people who are adding these config settings to talk & see if they make sense in combination. 2. We avoid the trap of hard dying on older versions. >> >> >> don't get why we have code like this: >> >> >> >> @@ -503,10 +504,12 @@ static void unpack_all(void) >> >> if (!quiet) >> >> progress = start_progress(_("Unpacking objects"), nr_objects); >> >> CALLOC_ARRAY(obj_list, nr_objects); >> >> + plug_bulk_checkin(); >> >> for (i = 0; i < nr_objects; i++) { >> >> unpack_one(i); >> >> display_progress(progress, i + 1); >> >> } >> >> + unplug_bulk_checkin(); >> >> stop_progress(&progress); >> >> >> >> if (delta_list) >> >> >> >> As opposed to doing an fsync on the last object we're >> >> processing. I.e. why do we need the step of intentionally making the >> >> objects unavailable in the tmp-objdir, and creating a "cookie" file to >> >> sync at the start/end, as opposed to fsyncing on the last file (which >> >> we're writing out anyway). >> >> >> >> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/ >> >> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/ >> > >> > It's important to not expose an object's final name until its contents >> > have been fsynced >> > to disk. We want to ensure that wherever we crash, we won't have a >> > loose object that >> > Git may later try to open where the filename doesn't match the content >> > hash. I believe it's >> > okay for a given OID to be missing, since a later command could >> > recreate it, but an object >> > with a wrong hash looks like it would persist until we do a git-fsck. >> >> Yes, we handle that rather badly, as I mentioned in some other threads, >> but not doing the fsync on the last object v.s. a "cookie" file right >> afterwards seems like a hail-mary at best, no? >> > > I'm not quite grasping what you're saying here. Are you saying that > using a dummy > file instead of one of the actual objects is less likely to produce > the desired outcome > on actual filesystem implementations? [...covered below...] >> > I thought about figuring out how to sync the last object rather than some random >> > "cookie" file, but it wasn't clear to me how I'd figure out which >> > object is actually last >> > from library code in a way that doesn't burden each command with >> > somehow figuring >> > out its last object and communicating that. The 'cookie' approach >> > seems to lead to a cleaner >> > interface for callers. >> >> The above quoted code is looping through nr_objects isn't it? Can't a >> "do fsync" be passed down to unpack_one() when we process the last loose >> object? > > Are you proposing that we do something different for unpack_objects > versus update_index > and git-add? I was hoping to keep all of the users of the batch fsync > functionality equivalent. > For the git-add workflow and update-index, we'd need to track the most > recent file so that we > can go back and fsync it. I don't believe that syncing the last > object composes well with the existing > implementation of those commands. There's probably cases where we need the cookie. I just mean instead of the API being (as seen above in the quoted part), pseudocode: # A bulk_checkin_start_make_cookie(): n = 10 for i in 1..n: write_nth(i, fsync: 0); bulk_checkin_end_commit_cookie(); To have it be: # B bulk_checkin_start(do_cookie: 0); n = 10 for i in 1..n: write_nth(i, fsync: (i == n)); bulk_checkin_end(); Or actually, presumably simpler as: # C all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general(); end_fsync = bulk_checkin_mode() ? 1 : all_fsync; n = 10; for i in 1..n: write_nth(i, fsync: (i == n) ? end_fsync : all_fsync); I.e. maybe there are cases where you really do need "A", but we're usually (always?) writing out N objects, and we usually know it at the same point where you'd want the plug_bulk_checkin/unplug_bulk_checkin, so just fsyncing the last object/file/ref/whatever means we don't need the whole ceremony of the cookie file. I don't mind it per-se, but "B" and "C" just seem a lot simpler, particulary since as those examples show we'll presumably want to pass down a "do fsync?" to these in general, and we even usually have a display_progress() in there. So doesn't just doing "B" or "C" eliminate the need for a cookie entirely? Another advantage of that is that you'll presumably want such tracking anyway even for the case of "A". Because as soon as you have say a batch operation of writing X objects and Y refs you'd want to track this anyway. I.e. either only fsync() on the ref write (particularly if there's just the one ref), or on the last ref, or for each ref and no object syncs. So this (like "C", except for the "do_batch" in the "end_fsync" case): # D do_batch = in_existing_bulk_checkin() ? 1 : 0; all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general(); end_fsync = bulk_checkin_mode() ? do_batch : all_fsync; n = 10; for i in 1..n: write_nth(i, fsync: (i == n) ? end_fsync : all_fsync); I mean, usually we'd want the "all refs", I'm just thinking of a case like "git fast-import" or other known-to-the-user batch operation. Or, as in the case of my 4bc1fd6e394 (pack-objects: rename .idx files into place after .bitmap files, 2021-09-09) we'd want to know that we're writing all of say *.bitmap, *.rev where we currently fsync() all of them, write *.bitmap, *.rev and *.pack (not sure that one is safe) without fsync(), and then only fsync (or that and in-place move) the *.idx.
On Wed, Dec 01 2021, Ævar Arnfjörð Bjarmason wrote: > On Wed, Nov 17 2021, Neeraj Singh wrote: > > [Very late reply, sorry] > >> On Tue, Nov 16, 2021 at 11:28 PM Ævar Arnfjörð Bjarmason >> <avarab@gmail.com> wrote: >>> >>> >>> On Tue, Nov 16 2021, Neeraj Singh wrote: >>> >>> > On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason >>> > <avarab@gmail.com> wrote: >>> >> >>> >> >>> >> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote: >>> >> >>> >> > * Per [2], I'm leaving the fsyncObjectFiles configuration as is with >>> >> > 'true', 'false', and 'batch'. This makes using old and new versions of >>> >> > git with 'batch' mode a little trickier, but hopefully people will >>> >> > generally be moving forward in versions. >>> >> > >>> >> > [1] See >>> >> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/ >>> >> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/ >>> >> >>> >> I really think leaving that in-place is just being unnecessarily >>> >> cavalier. There's a lot of mixed-version environments where git is >>> >> deployed in, and we almost never break the configuration in this way (I >>> >> think in the past always by mistake). >>> > >>> >> In this case it's easy to avoid it, and coming up with a less narrow >>> >> config model[1] seems like a good idea in any case to unify the various >>> >> outstanding work in this area. >>> >> >>> >> More generally on this series, per the thread ending in [2] I really >>> > >>> > My primary goal in all of these changes is to move git-for-windows over to >>> > a default of batch fsync so that it can get closer to other platforms >>> > in performance >>> > of 'git add' while still retaining the same level of data integrity. >>> > I'm hoping that >>> > most end-users are just sticking to defaults here. >>> > >>> > I'm happy to change the configuration schema again if there's a >>> > consensus from the Git >>> > community that backwards-compatibility of the configuration is >>> > actually important to someone. >>> > >>> > Also, if we're doing a deeper rethink of the fsync configuration (as >>> > prompted by this work and >>> > Eric Wong's and Patrick Steinhardts work), do we want to retain a mode >>> > where we fsync some >>> > parts of the persistent repo data but not others? If we add fsyncing >>> > of the index in addition to the refs, >>> > I believe we would have covered all of the critical data structures >>> > that would be needed to find the >>> > data that a user has added to the repo if they complete a series of >>> > git commands and then experience >>> > a system crash. >>> >>> Just talking about it is how we'll find consensus, maybe you & Junio >>> would like to keep it as-is. I don't see why we'd expose this bad edge >>> case in configuration handling to users when it's entirely avoidable, >>> and we're still in the design phase. >> >> After trying to figure out an implementation, I have a new proposal, >> which I've shared on the other thread [1]. >> >> [1] https://lore.kernel.org/git/CANQDOdcdhfGtPg0PxpXQA5gQ4x9VknKDKCCi4HEB0Z1xgnjKzg@mail.gmail.com/ > > This LGTM, or something simpler as Junio points out with his "too > fine-grained?" comment as a follow-up. I'm honestly quite apathetic > about what we end up with exactly as long as: > > 1. We get the people who are adding these config settings to talk & see if they make > sense in combination. > > 2. We avoid the trap of hard dying on older versions. > >>> >>> >> don't get why we have code like this: >>> >> >>> >> @@ -503,10 +504,12 @@ static void unpack_all(void) >>> >> if (!quiet) >>> >> progress = start_progress(_("Unpacking objects"), nr_objects); >>> >> CALLOC_ARRAY(obj_list, nr_objects); >>> >> + plug_bulk_checkin(); >>> >> for (i = 0; i < nr_objects; i++) { >>> >> unpack_one(i); >>> >> display_progress(progress, i + 1); >>> >> } >>> >> + unplug_bulk_checkin(); >>> >> stop_progress(&progress); >>> >> >>> >> if (delta_list) >>> >> >>> >> As opposed to doing an fsync on the last object we're >>> >> processing. I.e. why do we need the step of intentionally making the >>> >> objects unavailable in the tmp-objdir, and creating a "cookie" file to >>> >> sync at the start/end, as opposed to fsyncing on the last file (which >>> >> we're writing out anyway). >>> >> >>> >> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/ >>> >> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/ >>> > >>> > It's important to not expose an object's final name until its contents >>> > have been fsynced >>> > to disk. We want to ensure that wherever we crash, we won't have a >>> > loose object that >>> > Git may later try to open where the filename doesn't match the content >>> > hash. I believe it's >>> > okay for a given OID to be missing, since a later command could >>> > recreate it, but an object >>> > with a wrong hash looks like it would persist until we do a git-fsck. >>> >>> Yes, we handle that rather badly, as I mentioned in some other threads, >>> but not doing the fsync on the last object v.s. a "cookie" file right >>> afterwards seems like a hail-mary at best, no? >>> >> >> I'm not quite grasping what you're saying here. Are you saying that >> using a dummy >> file instead of one of the actual objects is less likely to produce >> the desired outcome >> on actual filesystem implementations? > > [...covered below...] > >>> > I thought about figuring out how to sync the last object rather than some random >>> > "cookie" file, but it wasn't clear to me how I'd figure out which >>> > object is actually last >>> > from library code in a way that doesn't burden each command with >>> > somehow figuring >>> > out its last object and communicating that. The 'cookie' approach >>> > seems to lead to a cleaner >>> > interface for callers. >>> >>> The above quoted code is looping through nr_objects isn't it? Can't a >>> "do fsync" be passed down to unpack_one() when we process the last loose >>> object? >> >> Are you proposing that we do something different for unpack_objects >> versus update_index >> and git-add? I was hoping to keep all of the users of the batch fsync >> functionality equivalent. >> For the git-add workflow and update-index, we'd need to track the most >> recent file so that we >> can go back and fsync it. I don't believe that syncing the last >> object composes well with the existing >> implementation of those commands. > > There's probably cases where we need the cookie. I just mean instead of > the API being (as seen above in the quoted part), pseudocode: > > # A > bulk_checkin_start_make_cookie(): > n = 10 > for i in 1..n: > write_nth(i, fsync: 0); > bulk_checkin_end_commit_cookie(); > > To have it be: > > # B > bulk_checkin_start(do_cookie: 0); > n = 10 > for i in 1..n: > write_nth(i, fsync: (i == n)); > bulk_checkin_end(); > > Or actually, presumably simpler as: > > # C > all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general(); > end_fsync = bulk_checkin_mode() ? 1 : all_fsync; > n = 10; > for i in 1..n: > write_nth(i, fsync: (i == n) ? end_fsync : all_fsync); > > I.e. maybe there are cases where you really do need "A", but we're > usually (always?) writing out N objects, and we usually know it at the > same point where you'd want the plug_bulk_checkin/unplug_bulk_checkin, > so just fsyncing the last object/file/ref/whatever means we don't need > the whole ceremony of the cookie file. > > I don't mind it per-se, but "B" and "C" just seem a lot simpler, > particulary since as those examples show we'll presumably want to pass > down a "do fsync?" to these in general, and we even usually have a > display_progress() in there. > > So doesn't just doing "B" or "C" eliminate the need for a cookie > entirely? > > Another advantage of that is that you'll presumably want such tracking > anyway even for the case of "A". > > Because as soon as you have say a batch operation of writing X objects > and Y refs you'd want to track this anyway. I.e. either only fsync() on > the ref write (particularly if there's just the one ref), or on the last > ref, or for each ref and no object syncs. So this (like "C", except for > the "do_batch" in the "end_fsync" case): > > # D > do_batch = in_existing_bulk_checkin() ? 1 : 0; > all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general(); > end_fsync = bulk_checkin_mode() ? do_batch : all_fsync; > n = 10; > for i in 1..n: > write_nth(i, fsync: (i == n) ? end_fsync : all_fsync); > > I mean, usually we'd want the "all refs", I'm just thinking of a case > like "git fast-import" or other known-to-the-user batch operation. > > Or, as in the case of my 4bc1fd6e394 (pack-objects: rename .idx files > into place after .bitmap files, 2021-09-09) we'd want to know that we're > writing all of say *.bitmap, *.rev where we currently fsync() all of > them, write *.bitmap, *.rev and *.pack (not sure that one is safe) > without fsync(), and then only fsync (or that and in-place move) the > *.idx. Replying to an old-ish E-Mail of mine with some more thought that came to mind after[1] (another recently resurrected fsync() thread). I wonder if there's another twist on the plan outlined in [2] that would be both portable & efficient, i.e. the "slow" POSIX way to write files A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush N times. And as we've discussed, doing it just on Z will implicitly flush A..Y on common OS's in the wild, which we're taking advantage of here. But aside from the rename() dance in[2], what do those OS's do if you write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably equivalently, in reverse order: Y..A). I'd think they'd be smart enough to know that they already implicitly flushed that data since Z was flushend, and make those fsync()'s a rather cheap noop. But I don't know, hence the question. If that's true then perhaps it's a path towards having our cake and eating it too in some cases? I.e. an FS that would flush A..Y if we flush Z would do so quickly and reliably, whereas a FS that doesn't have such an optimization might be just as slow for all of A..Y, but at least it'll be safe. 1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail.com/ 2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@gmail.com/
On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > Replying to an old-ish E-Mail of mine with some more thought that came > to mind after[1] (another recently resurrected fsync() thread). > > I wonder if there's another twist on the plan outlined in [2] that would > be both portable & efficient, i.e. the "slow" POSIX way to write files > A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush > N times. > > And as we've discussed, doing it just on Z will implicitly flush A..Y on > common OS's in the wild, which we're taking advantage of here. > > But aside from the rename() dance in[2], what do those OS's do if you > write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably > equivalently, in reverse order: Y..A). > > I'd think they'd be smart enough to know that they already implicitly > flushed that data since Z was flushend, and make those fsync()'s a > rather cheap noop. > > But I don't know, hence the question. > > If that's true then perhaps it's a path towards having our cake and > eating it too in some cases? > > I.e. an FS that would flush A..Y if we flush Z would do so quickly and > reliably, whereas a FS that doesn't have such an optimization might be > just as slow for all of A..Y, but at least it'll be safe. > > 1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail.com/ > 2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@gmail.com/ The important angle here is that we need some way to indicate to the OS what A..Y is before we fsync on Z. I.e. the OS will cache any writes in memory until some sync-ish operation is done on *that specific file*. Syncing just 'Z' with no sync operations on A..Y doesn't indicate that A..Y would get written out. Apparently the bad old ext3 behavior was similar to what you're proposing where a sync on 'Z' would imply something about independent files. Here's an interesting paper I recently came across that proposes the interface we'd really want, 'syncv': https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1&type=pdf. Thanks, Neeraj
On Wed, Mar 09 2022, Neeraj Singh wrote: > On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: >> >> Replying to an old-ish E-Mail of mine with some more thought that came >> to mind after[1] (another recently resurrected fsync() thread). >> >> I wonder if there's another twist on the plan outlined in [2] that would >> be both portable & efficient, i.e. the "slow" POSIX way to write files >> A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush >> N times. >> >> And as we've discussed, doing it just on Z will implicitly flush A..Y on >> common OS's in the wild, which we're taking advantage of here. >> >> But aside from the rename() dance in[2], what do those OS's do if you >> write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably >> equivalently, in reverse order: Y..A). >> >> I'd think they'd be smart enough to know that they already implicitly >> flushed that data since Z was flushend, and make those fsync()'s a >> rather cheap noop. >> >> But I don't know, hence the question. >> >> If that's true then perhaps it's a path towards having our cake and >> eating it too in some cases? >> >> I.e. an FS that would flush A..Y if we flush Z would do so quickly and >> reliably, whereas a FS that doesn't have such an optimization might be >> just as slow for all of A..Y, but at least it'll be safe. >> >> 1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail.com/ >> 2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@gmail.com/ > > The important angle here is that we need some way to indicate to the > OS what A..Y is before we fsync on Z. I.e. the OS will cache any > writes in memory until some sync-ish operation is done on *that > specific file*. Syncing just 'Z' with no sync operations on A..Y > doesn't indicate that A..Y would get written out. Apparently the bad > old ext3 behavior was similar to what you're proposing where a sync on > 'Z' would imply something about independent files. It's certainly starting to sound like I'm misunderstanding this whole thing, but just to clarify again I'm talking about the sort of loops mentioned upthread in my [1]. I.e. you have (to copy from that E-Mail): bulk_checkin_start_make_cookie(): n = 10 for i in 1..n: write_nth(i, fsync: 0); bulk_checkin_end_commit_cookie(); I.e. we have a "cookie" file in a given dir (where, in this example, we'd also write files A..Z). I.e. we write: cookie {A..Z} cookie And then only fsync() on the "cookie" at the end, which "flushes" the A..Z updates on some FS's (again, all per my possibly-incorrect understanding). Which is why I proposed that in many/all cases we could do this, i.e. just the same without the "cookie" file (which AFAICT isn't needed per-se, but was just added to make the API a bit simpler in not needing to modify the relevant loops): all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general(); end_fsync = bulk_checkin_mode() ? 1 : all_fsync; n = 10; for i in 1..n: write_nth(i, fsync: (i == n) ? end_fsync : all_fsync); I.e. we don't pay the cost of the fsync() as we're in the loop, but just for the last file, which "flushes" the rest. So far all of that's a paraphrasing of existing exchanges, but what I was wondering now in[2] is if we add this to this last example above: for i in 1..n-1: fsync_nth(i) Wouldn't those same OS's that are being clever about deferring the syncing of A..Z as a "batch" be clever enough to turn that (re-)syncing into a NOOP? Of course in this case we'd need to keep the fd's open and be clever about E[MN]FILE (i.e. "Too many open..."), or do an fsync() every Nth for some reasonable Nth, e.g. somewhere in the 2^10..2^12 range. But *if* this works it seems to me to be something we might be able to enable when "core.fsyncObjectFiles" is configured on those systems. I.e. the implicit assumption with that configuration was that if we sync N loose objects and then update and fsync the ref that the FS would queue up the ref update after the syncing of the loose objects. This new "cookie" (or my suggested "fsync last of N") is basically making the same assumption, just with the slight twist that some OSs/FSs are known to behave like that on a per-subdir basis, no? > Here's an interesting paper I recently came across that proposes the > interface we'd really want, 'syncv': > https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1&type=pdf. 1. https://lore.kernel.org/git/211201.864k7sbdjt.gmgdl@evledraar.gmail.com/ 2. https://lore.kernel.org/git/220310.86lexilo3d.gmgdl@evledraar.gmail.com/
On Thu, Mar 10, 2022 at 6:17 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Wed, Mar 09 2022, Neeraj Singh wrote: > > > On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > >> > >> Replying to an old-ish E-Mail of mine with some more thought that came > >> to mind after[1] (another recently resurrected fsync() thread). > >> > >> I wonder if there's another twist on the plan outlined in [2] that would > >> be both portable & efficient, i.e. the "slow" POSIX way to write files > >> A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush > >> N times. > >> > >> And as we've discussed, doing it just on Z will implicitly flush A..Y on > >> common OS's in the wild, which we're taking advantage of here. > >> > >> But aside from the rename() dance in[2], what do those OS's do if you > >> write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably > >> equivalently, in reverse order: Y..A). > >> > >> I'd think they'd be smart enough to know that they already implicitly > >> flushed that data since Z was flushend, and make those fsync()'s a > >> rather cheap noop. > >> > >> But I don't know, hence the question. > >> > >> If that's true then perhaps it's a path towards having our cake and > >> eating it too in some cases? > >> > >> I.e. an FS that would flush A..Y if we flush Z would do so quickly and > >> reliably, whereas a FS that doesn't have such an optimization might be > >> just as slow for all of A..Y, but at least it'll be safe. > >> > >> 1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail.com/ > >> 2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@gmail.com/ > > > > The important angle here is that we need some way to indicate to the > > OS what A..Y is before we fsync on Z. I.e. the OS will cache any > > writes in memory until some sync-ish operation is done on *that > > specific file*. Syncing just 'Z' with no sync operations on A..Y > > doesn't indicate that A..Y would get written out. Apparently the bad > > old ext3 behavior was similar to what you're proposing where a sync on > > 'Z' would imply something about independent files. > > It's certainly starting to sound like I'm misunderstanding this whole > thing, but just to clarify again I'm talking about the sort of loops > mentioned upthread in my [1]. I.e. you have (to copy from that E-Mail): > > bulk_checkin_start_make_cookie(): > n = 10 > for i in 1..n: > write_nth(i, fsync: 0); > bulk_checkin_end_commit_cookie(); > > I.e. we have a "cookie" file in a given dir (where, in this example, > we'd also write files A..Z). I.e. we write: > > cookie > {A..Z} > cookie > > And then only fsync() on the "cookie" at the end, which "flushes" the > A..Z updates on some FS's (again, all per my possibly-incorrect > understanding). > > Which is why I proposed that in many/all cases we could do this, > i.e. just the same without the "cookie" file (which AFAICT isn't needed > per-se, but was just added to make the API a bit simpler in not needing > to modify the relevant loops): > > all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general(); > end_fsync = bulk_checkin_mode() ? 1 : all_fsync; > n = 10; > for i in 1..n: > write_nth(i, fsync: (i == n) ? end_fsync : all_fsync); > > I.e. we don't pay the cost of the fsync() as we're in the loop, but just > for the last file, which "flushes" the rest. > > So far all of that's a paraphrasing of existing exchanges, but what I > was wondering now in[2] is if we add this to this last example above: > > for i in 1..n-1: > fsync_nth(i) > > Wouldn't those same OS's that are being clever about deferring the > syncing of A..Z as a "batch" be clever enough to turn that (re-)syncing > into a NOOP? > > Of course in this case we'd need to keep the fd's open and be clever > about E[MN]FILE (i.e. "Too many open..."), or do an fsync() every Nth > for some reasonable Nth, e.g. somewhere in the 2^10..2^12 range. > > But *if* this works it seems to me to be something we might be able to > enable when "core.fsyncObjectFiles" is configured on those systems. > > I.e. the implicit assumption with that configuration was that if we sync > N loose objects and then update and fsync the ref that the FS would > queue up the ref update after the syncing of the loose objects. > > This new "cookie" (or my suggested "fsync last of N") is basically > making the same assumption, just with the slight twist that some OSs/FSs > are known to behave like that on a per-subdir basis, no? > > > Here's an interesting paper I recently came across that proposes the > > interface we'd really want, 'syncv': > > https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1&type=pdf. > > 1. https://lore.kernel.org/git/211201.864k7sbdjt.gmgdl@evledraar.gmail.com/ > 2. https://lore.kernel.org/git/220310.86lexilo3d.gmgdl@evledraar.gmail.com/ On the actual FS implementations in the three common OSes I'm familiar with (macOS, Windows, Linux), each file has its own independent data caching in OS memory. Fsyncing one of them doesn't necessarily imply writing out the OS cache for any other file. Except, apparently, on ext3 in data=ordered mode, but that FS is no longer common. On Linux, we use sync_file_range to get the OS to write the in-memory cache to the storage hardware, which is what makes the data 'available' to fsync. Now, we could consider an implementation where we call sync_file_range without the wait flags (i.e. without SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER). Then we could later fsync every file (or batch of files), which might be more efficient if the OS coalesces the disk cache flushes. I expect that this method is less likely to give us the desired performance on common linux FSes, however. The macOS and Windows APIs are defined a bit differently from Linux. In both those OSes, we're actually calling fsync-equivalent APIs that are defined to write back all the relevant data and metadata, just without the storage cache flush. So to summarize: 1. We need to do write(2) to get the data out of Git and into the OS filesystem cache. 2. We need some API (macOS-fsync, Windows-NtFlushBuffersFileEx, Linux-sync_file_range) to transfer the data per-file to the storage controller, but without flushing the storage controller. 3. We need some api (macOS-F_FULLFSYNC, Windows-NtFlushBuffersFile, Linux-fsync) to push the storage controller cache to durable media. This only needs to be done once at the end to push out the data made available in step (2). Thanks, Neeraj
On March 10, 2022 12:53 PM, Neeraj Singh wrote: >On Thu, Mar 10, 2022 at 6:17 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> >wrote: >> >> >> On Wed, Mar 09 2022, Neeraj Singh wrote: >> >> > On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason ><avarab@gmail.com> wrote: >> >> >> >> Replying to an old-ish E-Mail of mine with some more thought that >> >> came to mind after[1] (another recently resurrected fsync() thread). >> >> >> >> I wonder if there's another twist on the plan outlined in [2] that >> >> would be both portable & efficient, i.e. the "slow" POSIX way to >> >> write files A..Z is to open/write/close/fsync each one, so we'll >> >> trigger a HW flush N times. >> >> >> >> And as we've discussed, doing it just on Z will implicitly flush >> >> A..Y on common OS's in the wild, which we're taking advantage of here. >> >> >> >> But aside from the rename() dance in[2], what do those OS's do if >> >> you write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, >> >> presumably equivalently, in reverse order: Y..A). >> >> >> >> I'd think they'd be smart enough to know that they already >> >> implicitly flushed that data since Z was flushend, and make those >> >> fsync()'s a rather cheap noop. >> >> >> >> But I don't know, hence the question. >> >> >> >> If that's true then perhaps it's a path towards having our cake and >> >> eating it too in some cases? >> >> >> >> I.e. an FS that would flush A..Y if we flush Z would do so quickly >> >> and reliably, whereas a FS that doesn't have such an optimization >> >> might be just as slow for all of A..Y, but at least it'll be safe. >> >> >> >> 1. >> >> https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail >> >> .com/ 2. >> >> https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f >> >> 3.1637020263.git.gitgitgadget@gmail.com/ >> > >> > The important angle here is that we need some way to indicate to the >> > OS what A..Y is before we fsync on Z. I.e. the OS will cache any >> > writes in memory until some sync-ish operation is done on *that >> > specific file*. Syncing just 'Z' with no sync operations on A..Y >> > doesn't indicate that A..Y would get written out. Apparently the >> > bad old ext3 behavior was similar to what you're proposing where a >> > sync on 'Z' would imply something about independent files. >> >> It's certainly starting to sound like I'm misunderstanding this whole >> thing, but just to clarify again I'm talking about the sort of loops >> mentioned upthread in my [1]. I.e. you have (to copy from that E-Mail): >> >> bulk_checkin_start_make_cookie(): >> n = 10 >> for i in 1..n: >> write_nth(i, fsync: 0); >> bulk_checkin_end_commit_cookie(); >> >> I.e. we have a "cookie" file in a given dir (where, in this example, >> we'd also write files A..Z). I.e. we write: >> >> cookie >> {A..Z} >> cookie >> >> And then only fsync() on the "cookie" at the end, which "flushes" the >> A..Z updates on some FS's (again, all per my possibly-incorrect >> understanding). >> >> Which is why I proposed that in many/all cases we could do this, i.e. >> just the same without the "cookie" file (which AFAICT isn't needed >> per-se, but was just added to make the API a bit simpler in not >> needing to modify the relevant loops): >> >> all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general(); >> end_fsync = bulk_checkin_mode() ? 1 : all_fsync; >> n = 10; >> for i in 1..n: >> write_nth(i, fsync: (i == n) ? end_fsync : all_fsync); >> >> I.e. we don't pay the cost of the fsync() as we're in the loop, but >> just for the last file, which "flushes" the rest. >> >> So far all of that's a paraphrasing of existing exchanges, but what I >> was wondering now in[2] is if we add this to this last example above: >> >> for i in 1..n-1: >> fsync_nth(i) >> >> Wouldn't those same OS's that are being clever about deferring the >> syncing of A..Z as a "batch" be clever enough to turn that >> (re-)syncing into a NOOP? >> >> Of course in this case we'd need to keep the fd's open and be clever >> about E[MN]FILE (i.e. "Too many open..."), or do an fsync() every Nth >> for some reasonable Nth, e.g. somewhere in the 2^10..2^12 range. >> >> But *if* this works it seems to me to be something we might be able to >> enable when "core.fsyncObjectFiles" is configured on those systems. >> >> I.e. the implicit assumption with that configuration was that if we >> sync N loose objects and then update and fsync the ref that the FS >> would queue up the ref update after the syncing of the loose objects. >> >> This new "cookie" (or my suggested "fsync last of N") is basically >> making the same assumption, just with the slight twist that some >> OSs/FSs are known to behave like that on a per-subdir basis, no? >> >> > Here's an interesting paper I recently came across that proposes the >> > interface we'd really want, 'syncv': >> > >https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1 >&type=pdf. >> >> 1. >> https://lore.kernel.org/git/211201.864k7sbdjt.gmgdl@evledraar.gmail.co >> m/ 2. >> https://lore.kernel.org/git/220310.86lexilo3d.gmgdl@evledraar.gmail.co >> m/ > >On the actual FS implementations in the three common OSes I'm familiar with >(macOS, Windows, Linux), each file has its own independent data caching in OS >memory. Fsyncing one of them doesn't necessarily imply writing out the OS cache >for any other file. Except, apparently, on ext3 in data=ordered mode, but that FS >is no longer common. On Linux, we use sync_file_range to get the OS to write the >in-memory cache to the storage hardware, which is what makes the data >'available' to fsync. > >Now, we could consider an implementation where we call sync_file_range >without the wait flags (i.e. without SYNC_FILE_RANGE_WAIT_BEFORE and >SYNC_FILE_RANGE_WAIT_AFTER). Then we could later fsync every file (or batch >of files), which might be more efficient if the OS coalesces the disk cache flushes. I >expect that this method is less likely to give us the desired performance on >common linux FSes, however. > >The macOS and Windows APIs are defined a bit differently from Linux. >In both those OSes, >we're actually calling fsync-equivalent APIs that are defined to write back all the >relevant data and metadata, just without the storage cache flush. > >So to summarize: >1. We need to do write(2) to get the data out of Git and into the OS filesystem >cache. >2. We need some API (macOS-fsync, Windows-NtFlushBuffersFileEx, >Linux-sync_file_range) > to transfer the data per-file to the storage controller, but without flushing the >storage controller. >3. We need some api (macOS-F_FULLFSYNC, Windows-NtFlushBuffersFile, Linux- >fsync) > to push the storage controller cache to durable media. This only needs to be >done once > at the end to push out the data made available in step (2). While this might not be a surprise, on some platforms fsync is a thread-blocking operation. When the OS has kernel threads, fsync can potentially cause multiple processes (if implemented that way) to block, particularly where an fd is shared across threads (and thus processes), which may end up causing a deadlock. We might need to keep an eye out for this type of situation in the future and at least try to test for it. I cannot actually see a situation where this would occur in git, but that does not mean it is impossible. Food for thought. --Randall
On Thu, Mar 10, 2022 at 10:08 AM <rsbecker@nexbridge.com> wrote: > While this might not be a surprise, on some platforms fsync is a thread-blocking operation. When the OS has kernel threads, fsync can potentially cause multiple processes (if implemented that way) to block, particularly where an fd is shared across threads (and thus processes), which may end up causing a deadlock. We might need to keep an eye out for this type of situation in the future and at least try to test for it. I cannot actually see a situation where this would occur in git, but that does not mean it is impossible. Food for thought. > --Randall fsync is expected to block the calling thread until the underlying data is durable. Unless the OS somehow depends on the git process to make progress before fsync can complete, there should be no deadlock, since there would be no cycle in the waiting graph. This could be a problem for FUSE implementations that are backed by Git, but they already have to deal with that possiblity today and this patch series doesn't change anything.
On March 10, 2022 1:43 PM, Neeraj Singh wrote: >On Thu, Mar 10, 2022 at 10:08 AM <rsbecker@nexbridge.com> wrote: >> While this might not be a surprise, on some platforms fsync is a thread-blocking >operation. When the OS has kernel threads, fsync can potentially cause multiple >processes (if implemented that way) to block, particularly where an fd is shared >across threads (and thus processes), which may end up causing a deadlock. We >might need to keep an eye out for this type of situation in the future and at least >try to test for it. I cannot actually see a situation where this would occur in git, but >that does not mean it is impossible. Food for thought. >> --Randall > >fsync is expected to block the calling thread until the underlying data is durable. >Unless the OS somehow depends on the git process to make progress before >fsync can complete, there should be no deadlock, since there would be no cycle in >the waiting graph. This could be a problem for FUSE implementations that are >backed by Git, but they already have to deal with that possiblity today and this >patch series doesn't change anything. That assumption is based on a specific threading model. In cooperative user-thread models, fsync is process-blocking. While fsync, by spec is required to block the thread, there are no limitations on blocking everything else. In some systems, an fsync can block the entire file system. Just pointing that out.