diff mbox series

[v3,3/4] gitfaq: add entry about syncing working trees

Message ID 20240704003818.750223-4-sandals@crustytoothpaste.net (mailing list archive)
State New
Headers show
Series Additional FAQ entries | expand

Commit Message

brian m. carlson July 4, 2024, 12:38 a.m. UTC
Users very commonly want to sync their working tree with uncommitted
changes across machines, often to carry across in-progress work or
stashes.  Despite this not being a recommended approach, users want to
do it and are not dissuaded by suggestions not to, so let's recommend a
sensible technique.

The technique that many users are using is their preferred cloud syncing
service, which is a bad idea.  Users have reported problems where they
end up with duplicate files that won't go away (with names like "file.c
2"), broken references, oddly named references that have date stamps
appended to them, missing objects, and general corruption and data loss.
That's because almost all of these tools sync file by file, which is a
great technique if your project is a single word processing document or
spreadsheet, but is utterly abysmal for Git repositories because they
don't necessarily snapshot the entire repository correctly.  They also
tend to sync the files immediately instead of when the repository is
quiescent, so writing multiple files, as occurs during a commit or a gc,
can confuse the tools and lead to corruption.

We know that the old standby, rsync, is up to the task, provided that
the repository is quiescent, so let's suggest that and dissuade people
from using cloud syncing tools.  Let's tell people about common things
they should be aware of before doing this and that this is still
potentially risky.  Additionally, let's tell people that Git's security
model does not permit sharing working trees across users in case they
planned to do that.  While we'd still prefer users didn't try to do
this, hopefully this will lead them in a safer direction.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/gitfaq.txt | 48 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

Comments

Junio C Hamano July 4, 2024, 5:21 a.m. UTC | #1
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> -Credentials
> ------------
> +Credentials and Transfers
> +-------------------------

I can see (and appreciate) that you struggled to find a good section
to piggyback on, instead of giving this topic its own section.  But
do these two make a good mix?  They seem to be totally different
topics.

> +It is important not to use a cloud syncing service to sync any portion of a Git
> +repository, since this can cause corruption, such as missing objects, changed
> +or added files, broken refs, and a wide variety of other corruption.  These
> +services tend to sync file by file on a continuous basis and don't understand
> +the structure of a Git repository.  This is especially bad if they sync the
> +repository in the middle of it being updated, since that is very likely to
> +cause incomplete or partial updates and therefore data loss.

A naïve reader may say "but isn't it the point of these cloud
syncing service that they will eventually catch up???" and we may
want to have a good story why it does not work.

    You create many objects in one repository in loose form, cloud
    syncing service kicks in to transfer them to the second
    repository, and then in the original repository an auto-gc kicks
    in so some of the loose objects fail to propagate.  The packfile
    that is the result of auto-gc will eventually propagate to the
    second repository, but before it completes, the second
    repository would be in an inconsistent state, and especially if
    the ref updates are propagated before objects, then the second
    repository will be in a corrupt state.  It would be a disaster
    if another auto-gc kicked in there.

is one scenario I came up with.
brian m. carlson July 4, 2024, 9:08 p.m. UTC | #2
On 2024-07-04 at 05:21:55, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > -Credentials
> > ------------
> > +Credentials and Transfers
> > +-------------------------
> 
> I can see (and appreciate) that you struggled to find a good section
> to piggyback on, instead of giving this topic its own section.  But
> do these two make a good mix?  They seem to be totally different
> topics.

I can try again.

> > +It is important not to use a cloud syncing service to sync any portion of a Git
> > +repository, since this can cause corruption, such as missing objects, changed
> > +or added files, broken refs, and a wide variety of other corruption.  These
> > +services tend to sync file by file on a continuous basis and don't understand
> > +the structure of a Git repository.  This is especially bad if they sync the
> > +repository in the middle of it being updated, since that is very likely to
> > +cause incomplete or partial updates and therefore data loss.
> 
> A naïve reader may say "but isn't it the point of these cloud
> syncing service that they will eventually catch up???" and we may
> want to have a good story why it does not work.
> 
>     You create many objects in one repository in loose form, cloud
>     syncing service kicks in to transfer them to the second
>     repository, and then in the original repository an auto-gc kicks
>     in so some of the loose objects fail to propagate.  The packfile
>     that is the result of auto-gc will eventually propagate to the
>     second repository, but before it completes, the second
>     repository would be in an inconsistent state, and especially if
>     the ref updates are propagated before objects, then the second
>     repository will be in a corrupt state.  It would be a disaster
>     if another auto-gc kicked in there.
> 
> is one scenario I came up with.

The most common situation we see is that refs tend to be renamed to
things like "refs/heads/main 2", which is obviously not a valid refname
and doesn't work, or the ref gets rolled back to an older version.
Working trees also get stuck into weird states where files keep coming
back or getting deleted, or the index gets two differently named copies,
neither of which is "index".

It is _less_ likely that objects are renamed, but it could be that the
tool thinks they've been legitimately deleted if the loose objects get
packed and then they do get deleted elsewhere without another source of
those objects existing.  I'm not sure how object loss happens in the
real world with these services, but there have been users reporting it
on StackOverflow, so I'm confident it does occur.

If we have users who ask about this, I'm happy to answer them on the
list.  I don't want to explain the various and sundry scenarios in the
FAQ entry in order to keep it short, but I can find several examples of
problems if need be.
Junio C Hamano July 6, 2024, 5:50 a.m. UTC | #3
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> The most common situation we see is that refs tend to be renamed to
> things like "refs/heads/main 2", which is obviously not a valid refname
> and doesn't work, or the ref gets rolled back to an older version.
> Working trees also get stuck into weird states where files keep coming
> back or getting deleted, or the index gets two differently named copies,
> neither of which is "index".
>
> It is _less_ likely that objects are renamed, but it could be that the
> tool thinks they've been legitimately deleted if the loose objects get
> packed and then they do get deleted elsewhere without another source of
> those objects existing.

Yeah, any time two repositories that are "cloud synched" are
accessed simultaneously, all h*ll can easily break loose.  You may
move your 'master' branch to a commit while the other one may move
their 'master' branch to a different commit.  You may end up having
"master" that points at one of these commits but one of you may have
already lost the only reference to the commit you wanted to have at
the tip of your 'master' branch.  One of you may even trigger auto-gc
to spread the damage.

> If we have users who ask about this, I'm happy to answer them on the
> list.  I don't want to explain the various and sundry scenarios in the
> FAQ entry in order to keep it short, but I can find several examples of
> problems if need be.

OK, that approach would work as long as you are still involved in
the project, but having even one concrete example would help in the
longer term to (1) reduce the bus factor and (2) save time you do
not have to spend responding to every such question.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt
index cdc5f5f4f8..0e7f1c680d 100644
--- a/Documentation/gitfaq.txt
+++ b/Documentation/gitfaq.txt
@@ -83,8 +83,8 @@  Windows would be the configuration `"C:\Program Files\Vim\gvim.exe" --nofork`,
 which quotes the filename with spaces and specifies the `--nofork` option to
 avoid backgrounding the process.
 
-Credentials
------------
+Credentials and Transfers
+-------------------------
 
 [[http-credentials]]
 How do I specify my credentials when pushing over HTTP?::
@@ -185,6 +185,50 @@  Then, you can adjust your push URL to use `git@example_author` or
 `git@example_committer` instead of `git@example.org` (e.g., `git remote set-url
 git@example_author:org1/project1.git`).
 
+[[sync-working-tree]]
+How do I sync a working tree across systems?::
+	First, decide whether you want to do this at all.  Git works best when you
+	push or pull your work using the typical `git push` and `git fetch` commands
+	and isn't designed to share a working tree across systems.  This is
+	potentially risky and in some cases can cause repository corruption or data
+	loss.
++
+Usually, doing so will cause `git status` to need to re-read every file in the
+working tree.  Additionally, Git's security model does not permit sharing a
+working tree across untrusted users, so it is only safe to sync a working tree
+if it will only be used by a single user across all machines.
++
+It is important not to use a cloud syncing service to sync any portion of a Git
+repository, since this can cause corruption, such as missing objects, changed
+or added files, broken refs, and a wide variety of other corruption.  These
+services tend to sync file by file on a continuous basis and don't understand
+the structure of a Git repository.  This is especially bad if they sync the
+repository in the middle of it being updated, since that is very likely to
+cause incomplete or partial updates and therefore data loss.
++
+Therefore, it's better to push your work to either the other system or a central
+server using the normal push and pull mechanism.  However, this doesn't always
+preserve important data, like stashes, so some people prefer to share a working
+tree across systems.
++
+If you do this, the recommended approach is to use `rsync -a --delete-after`
+(ideally with an encrypted connection such as with `ssh`) on the root of
+repository.  You should ensure several things when you do this:
++
+* If you have additional worktrees or a separate Git directory, they must be
+  synced at the same time as the main working tree and repository.
+* You are comfortable with the destination directory being an exact copy of the
+  source directory, _deleting any data that is already there_.
+* The repository (including all worktrees and the Git directory) is in a
+  quiescent state for the duration of the transfer (that is, no operations of
+  any sort are taking place on it, including background operations like `git
+  gc` and operations invoked by your editor).
++
+Be aware that even with these recommendations, syncing in this way has some risk
+since it bypasses Git's normal integrity checking for repositories, so having
+backups is advised.  You may also wish to do a `git fsck` to verify the
+integrity of your data on the destination system after syncing.
+
 Common Issues
 -------------