[4/4] gitfaq: add entry about syncing working trees

Message ID	20211020010624.675562-6-sandals@crustytoothpaste.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: "brian m. carlson" <sandals@crustytoothpaste.net> To: <git@vger.kernel.org> Cc: Jeff King <peff@peff.net>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, Derrick Stolee <dstolee@microsoft.com> Subject: [PATCH 4/4] gitfaq: add entry about syncing working trees Date: Wed, 20 Oct 2021 01:06:24 +0000 Message-Id: <20211020010624.675562-6-sandals@crustytoothpaste.net> In-Reply-To: <20211020010624.675562-1-sandals@crustytoothpaste.net> References: <20211020010624.675562-1-sandals@crustytoothpaste.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	None \| expand [4/4] gitfaq: add entry about syncing working trees

Message ID

20211020010624.675562-6-sandals@crustytoothpaste.net (mailing list archive)

State

New, archived

Headers

From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: <git@vger.kernel.org>
Cc: Jeff King <peff@peff.net>,
        Johannes Schindelin <Johannes.Schindelin@gmx.de>,
        Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH 4/4] gitfaq: add entry about syncing working trees
Date: Wed, 20 Oct 2021 01:06:24 +0000
Message-Id: <20211020010624.675562-6-sandals@crustytoothpaste.net>
In-Reply-To: <20211020010624.675562-1-sandals@crustytoothpaste.net>
References: <20211020010624.675562-1-sandals@crustytoothpaste.net>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

None | expand

Commit Message

brian m. carlson Oct. 20, 2021, 1:06 a.m. UTC

Users very commonly want to sync their working tree across machines,
often to carry across in-progress work or stashes.  Despite this not
being a recommended approach, users want to do it and are not dissuaded
by suggestions not to, so let's recommend a sensible technique.

The technique that many users are using is their preferred cloud syncing
service, which is a bad idea.  Users have reported problems where they
end up with duplicate files that won't go away (with names like "file.c
2"), broken references, oddly named references that have date stamps
appended to them, missing objects, and general corruption and data loss.
That's because almost all of these tools sync file by file, which is a
great technique if your project is a single word processing document or
spreadsheet, but is utterly abysmal for Git repositories because they
don't necessarily snapshot the entire repository correctly.  They also
tend to sync the files immediately instead of when the repository is
quiescent, so writing multiple files, as occurs during a commit or a gc,
can confuse the tools and lead to corruption.

We know that the old standby, rsync, is up to the task, provided that
the repository is quiescent, so let's suggest that and dissuade people
from using cloud syncing tools.  Let's tell people about common things
they should be aware of before doing this and that this is still
potentially risky.  Additionally, let's tell people that Git's security
model does not permit sharing working trees across users in case they
planned to do that.  While we'd still prefer users didn't try to do
this, hopefully this will lead them in a safer direction.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/gitfaq.txt | 43 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

Comments

Eric Sunshine Oct. 20, 2021, 1:38 a.m. UTC | #1

On Tue, Oct 19, 2021 at 9:06 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> gitfaq: add entry about syncing working trees

You sent two [4/4] patches. I'm guessing the one prefixed by "gitfaq:"
is the correct one.

> Users very commonly want to sync their working tree across machines,
> often to carry across in-progress work or stashes.  Despite this not
> being a recommended approach, users want to do it and are not dissuaded
> by suggestions not to, so let's recommend a sensible technique.
> [...]
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt
> @@ -185,6 +185,45 @@ Then, you can adjust your push URL to use `git@example_author` or
> +[[sync-working-tree]]
> +How do I sync a working tree across systems?::
> +       First, decide whether you want to do this at all.  Git usually works better
> +       when you push or pull your work using the typical `git push` and `git fetch`
> +       commands and isn't designed to share a working tree across systems.  Doing so
> +       can cause `git status` to need to re-read every file in the working tree.
> +       Additionally, Git's security model does not permit sharing a working tree
> +       across untrusted users, so it is only safe to sync a working tree if it will
> +       only be used by a single user across all machines.
> ++
> +Therefore, it's better to push your work to either the other system or a central
> +server using the normal push and pull mechanism.  However, this doesn't always
> +preserve important data, like stashes, so some people prefer to share a working
> +tree across systems.
> ++
> +It is important not to use a cloud syncing service to sync any portion of a Git
> +repository, since this can cause corruption, such as missing objects, changed
> +or added files, broken refs, and a wide variety of other corruption.  These
> +services tend to sync file by file and don't understand the structure of a Git
> +repository.  This is especially bad if they sync the repository in the middle of
> +it being updated, since that is very likely to cause incomplete or partial
> +updates and therefore data loss.
> ++
> +The recommended approach is to use `rsync -a --delete-after` (ideally with an
> +encrypted connection such as with `ssh`) on the root of repository.  You should
> +ensure several things when you do this:
> [...]
> +Be aware that even with these recommendations, syncing in this way is
> +potentially risky since it bypasses Git's normal integrity checking for
> +repositories, so having backups is advised.

Considering the potential damage which can result from this sort of
synching, this entire section seems too gentle. My knee-jerk reaction
is that it would be better to strongly dissuade upfront rather than
saying that it's okay to do this if you really want to. As such, I'm
wondering if organizing the section like this would be better:

(1) Make a strong statement against doing this: "<strong>Don't do it.</strong>"

(2) Explain why users shouldn't do it; in particular, the final
paragraph above which talks about integrity checks and whatnot should
be right up near the top along with discussion of corruption.

(3) Say that cloud-synching services must _not_ be used and explain why.

(4) Relent a tiny bit and explain that the only slightly acceptable
mechanism is `rsync` when used in a very strict fashion (quiescent
repository, etc.)

Johannes Schindelin Oct. 20, 2021, 12:09 p.m. UTC | #2

Hi brian,

On Wed, 20 Oct 2021, brian m. carlson wrote:

> Users very commonly want to sync their working tree across machines,

I was confused at first, because I do sync my working trees frequently. I
do this via `git push` and `git pull`, though.

Maybe clarify that you mean rsync, or sync services like DropBox and
OneDrive? I see that you mention "cloud syncing service" below, but I
believe that it might be better to lead with the examples.

> often to carry across in-progress work or stashes.  Despite this not
> being a recommended approach, users want to do it and are not dissuaded
> by suggestions not to, so let's recommend a sensible technique.
>
> The technique that many users are using is their preferred cloud syncing
> service, which is a bad idea.  Users have reported problems where they
> end up with duplicate files that won't go away (with names like "file.c
> 2"), broken references, oddly named references that have date stamps
> appended to them, missing objects, and general corruption and data loss.
> That's because almost all of these tools sync file by file, which is a
> great technique if your project is a single word processing document or
> spreadsheet, but is utterly abysmal for Git repositories because they
> don't necessarily snapshot the entire repository correctly.  They also
> tend to sync the files immediately instead of when the repository is
> quiescent, so writing multiple files, as occurs during a commit or a gc,
> can confuse the tools and lead to corruption.
>
> We know that the old standby, rsync, is up to the task, provided that
> the repository is quiescent, so let's suggest that and dissuade people
> from using cloud syncing tools.  Let's tell people about common things
> they should be aware of before doing this and that this is still
> potentially risky.  Additionally, let's tell people that Git's security
> model does not permit sharing working trees across users in case they
> planned to do that.  While we'd still prefer users didn't try to do
> this, hopefully this will lead them in a safer direction.

The remainder of the commit message is very clear.

Thank you,
Dscho

brian m. carlson Oct. 20, 2021, 9:36 p.m. UTC | #3

On 2021-10-20 at 01:38:47, Eric Sunshine wrote:
> On Tue, Oct 19, 2021 at 9:06 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > gitfaq: add entry about syncing working trees
> 
> You sent two [4/4] patches. I'm guessing the one prefixed by "gitfaq:"
> is the correct one.

Yes, I appear to have done a bad cherry-pick.  Will fix in v2.

> Considering the potential damage which can result from this sort of
> synching, this entire section seems too gentle. My knee-jerk reaction
> is that it would be better to strongly dissuade upfront rather than
> saying that it's okay to do this if you really want to. As such, I'm
> wondering if organizing the section like this would be better:
> 
> (1) Make a strong statement against doing this: "<strong>Don't do it.</strong>"

I agree this is dangerous.  The reason this is so painful is why I long
ago stopped having a desktop: I needed to sync in-progress work
frequently, and having multiple machines is too much of a hassle for
that case.  The laptop is more portable and can be used everywhere, even
if less powerful.

However, some people do legitimately need to work on the same project
across machines, and the current tooling for syncing stashes and other
in-progress work is insufficient.  Therefore, if we just tell people,
"Don't do this," they're going to stop reading and ignore us, because
we've neglected their needs and they have a job to do.  That would be
worse, because instead of using something like rsync, they might use a
cloud syncing service, and then they'll be really in a bad place.

However, I'm happy to try the reorganization you proposed, even if I
don't necessarily adopt the strength of the proposal, and see how it works.

diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt
index 85ac99c7b2..4a8a46f980 100644
--- a/Documentation/gitfaq.txt
+++ b/Documentation/gitfaq.txt
@@ -83,8 +83,8 @@  Windows would be the configuration `"C:\Program Files\Vim\gvim.exe" --nofork`,
 which quotes the filename with spaces and specifies the `--nofork` option to
 avoid backgrounding the process.
 
-Credentials
------------
+Credentials and Transfers
+-------------------------
 
 [[http-credentials]]
 How do I specify my credentials when pushing over HTTP?::
@@ -185,6 +185,45 @@  Then, you can adjust your push URL to use `git@example_author` or
 `git@example_committer` instead of `git@example.org` (e.g., `git remote set-url
 git@example_author:org1/project1.git`).
 
+[[sync-working-tree]]
+How do I sync a working tree across systems?::
+	First, decide whether you want to do this at all.  Git usually works better
+	when you push or pull your work using the typical `git push` and `git fetch`
+	commands and isn't designed to share a working tree across systems.  Doing so
+	can cause `git status` to need to re-read every file in the working tree.
+	Additionally, Git's security model does not permit sharing a working tree
+	across untrusted users, so it is only safe to sync a working tree if it will
+	only be used by a single user across all machines.
++
+Therefore, it's better to push your work to either the other system or a central
+server using the normal push and pull mechanism.  However, this doesn't always
+preserve important data, like stashes, so some people prefer to share a working
+tree across systems.
++
+It is important not to use a cloud syncing service to sync any portion of a Git
+repository, since this can cause corruption, such as missing objects, changed
+or added files, broken refs, and a wide variety of other corruption.  These
+services tend to sync file by file and don't understand the structure of a Git
+repository.  This is especially bad if they sync the repository in the middle of
+it being updated, since that is very likely to cause incomplete or partial
+updates and therefore data loss.
++
+The recommended approach is to use `rsync -a --delete-after` (ideally with an
+encrypted connection such as with `ssh`) on the root of repository.  You should
+ensure several things when you do this:
++
+* There are no additional worktrees enabled for your repository.
+* You are not using a separate Git directory outside of your repository root.
+* You are comfortable with the destination directory being an exact copy of the
+	source directory, _deleting any data that is already there_.
+* The repository is in a quiescent state for the duration of the transfer (that
+	is, no operations of any sort are taking place on it, including background
+	operations like `git gc`).
++
+Be aware that even with these recommendations, syncing in this way is
+potentially risky since it bypasses Git's normal integrity checking for
+repositories, so having backups is advised.
+
 Common Issues
 -------------

[4/4] gitfaq: add entry about syncing working trees

Commit Message

Comments

Patch