diff mbox series

[1/4] docs: add a question on syncing repositories to the FAQ

Message ID 20210227191813.96148-2-sandals@crustytoothpaste.net (mailing list archive)
State New, archived
Headers show
Series Documentation updates to FAQ and git-archive | expand

Commit Message

brian m. carlson Feb. 27, 2021, 7:18 p.m. UTC
It is very common that users want to transport repositories with working
trees across machines.  While this is not recommended, many users do it
anyway and moreover, do it using cloud syncing services, which often
corrupt their data.  The results of such are often seen in tales of woe
on common user question fora.

Let's tell users what we recommend they do in this circumstance and how
to do it safely.  Warn them about the dangers of untrusted working trees
and the downsides of index refreshes, as well as the problems with cloud
syncing services.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/gitfaq.txt | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

Comments

Ævar Arnfjörð Bjarmason Feb. 28, 2021, 1:01 p.m. UTC | #1
On Sat, Feb 27 2021, brian m. carlson wrote:

> It is very common that users want to transport repositories with working
> trees across machines.  While this is not recommended, many users do it
> anyway and moreover, do it using cloud syncing services, which often
> corrupt their data.  The results of such are often seen in tales of woe
> on common user question fora.
>
> Let's tell users what we recommend they do in this circumstance and how
> to do it safely.  Warn them about the dangers of untrusted working trees
> and the downsides of index refreshes, as well as the problems with cloud
> syncing services.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  Documentation/gitfaq.txt | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
>
> diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt
> index afdaeab850..042b11e88a 100644
> --- a/Documentation/gitfaq.txt
> +++ b/Documentation/gitfaq.txt
> @@ -241,6 +241,45 @@ How do I know if I want to do a fetch or a pull?::
>  	ignore the upstream changes.  A pull consists of a fetch followed
>  	immediately by either a merge or rebase.  See linkgit:git-pull[1].
>  
> +[[syncing-across-computers]]
> +How do I sync a Git repository across multiple computers, VMs, or operating systems?::
> +	The best way to sync a repository across computers is by pushing and fetching.
> +	This uses the native Git mechanisms to transport data efficiently and is the
> +	easiest and best way to move data across machines.  If the machines aren't
> +	connected by a network, you can use `git bundle` to create a file with your
> +	changes and then fetch or pull them from the file on the remote machine.
> +	Pushing and fetching are also the only secure ways to interact with a
> +	repository you don't own or trust.
> ++
> +However, sometimes people want to sync a repository with a working tree across
> +machines.  While this isn't recommended, it can be done with `rsync` (usually
> +over an SSH connection), but only when the repository is completely idle (that
> +is, no processes, including `git gc`, are modifying it at all).  If `rsync`
> +isn't available, you can use `tar` to create a tar archive of the repository and
> +copy it to another machine.  Zip files shouldn't be used due to their poor
> +support for permissions and symbolic links.
> ++
> +You may also use a shared file system between the two machines that is POSIX
> +compliant, such as SSHFS (SFTP) or NFSv4.  If you are using SFTP for this
> +purpose, the server should support fsync and POSIX renames (OpenSSH does).  File
> +systems that don't provide POSIX semantics, such as DAV mounts, shouldn't be
> +used.
> ++
> +Note that you must not work with untrusted working trees, since it's trivial
> +for an attacker to set configuration options that will cause arbitrary code to
> +be executed on your machine.  Also, in almost all cases when sharing a working
> +tree across machines, Git will need to re-read all files the next time you run
> +`git status` or otherwise refresh the index, which can be slow.  This generally
> +can't be avoided and is part of the reason why sharing a working tree isn't
> +recommended.
> ++
> +In no circumstances should you share a working tree or bare repository using a
> +cloud syncing service or store it in a directory managed by such a service.
> +Such services sync file by file and don't maintain the invariants required for
> +repository integrity; in addition, they can cause files to be added, removed, or
> +duplicated unexpectedly.  If you must use one of these services, use it to store
> +the repository in a tar archive instead.

I think documentation on this topic is needed, but wonder if we couldn't
make this more understandable by going to the heart of the matter, i.e.:

 * We prefer push/pull/bundle to copy/replicate .git content

 * Regardless, a .git directory can be copied across systems just fine
   if you recursively guarantee snapshot integrity, e.g. it doesn't
   depend on the endian-ness of the OS, or has anything like symlinks in
   there made by git itself.

 * Anything which copies .git data on a concurrently updated repo can
   lead to corruption, whether that's cp -R, rsync with any combination
   of flags, some cloud syncing service that expects to present that
   tree to two computers without guaranteeing POSIX fs semantics between
   the two etc.

 * A common pitfall with such copying of a .git directory is that file
   deletions are also critical, e.g. rsync without --delete is almost
   guaranteed to produce a corrupt .git if repeated enough times
   (e.g. git might prefer stale loose refs over now-packed ones).

 * It's OK to copy .git between system that differ in their support of
   symbolic links, but the work tree may be in an inconsistent state and
   need some manner of "git reset" to repair it.

And, not sure if this is correct:

 * It may be OK to edit a .git directory on a non-POSIX conforming fs
   (but perhaps validate the result with "git fsck"). But it's not OK to
   have two writing git processes work on such a repository at the same
   time. Keep in mind that certain operations and default settings (such
   as background gc, see `gc.autoDetach` in linkgit:git-config[1]) might
   result in two processes working on the directory even if you're
   changing it only in one terminal window at a time.

I.e. to go a bit beyond the docs you have of basically saying "there be
dragons in non-POSIX" and describe the particular scenarios where it can
go wrong. Something like the above still leaves the door open to users
using cloud syncing services, which they can then judge for themselves
as being OK or not. I'm sure there's some that are far from POSIX
compliance that are OK in practice if the above warnings are observed.
brian m. carlson March 15, 2021, 8:40 p.m. UTC | #2
On 2021-02-28 at 13:01:04, Ævar Arnfjörð Bjarmason wrote:
> make this more understandable by going to the heart of the matter, i.e.:
> 
>  * We prefer push/pull/bundle to copy/replicate .git content
> 
>  * Regardless, a .git directory can be copied across systems just fine
>    if you recursively guarantee snapshot integrity, e.g. it doesn't
>    depend on the endian-ness of the OS, or has anything like symlinks in
>    there made by git itself.

I'll revise to make this clearer.

>  * Anything which copies .git data on a concurrently updated repo can
>    lead to corruption, whether that's cp -R, rsync with any combination
>    of flags, some cloud syncing service that expects to present that
>    tree to two computers without guaranteeing POSIX fs semantics between
>    the two etc.
> 
>  * A common pitfall with such copying of a .git directory is that file
>    deletions are also critical, e.g. rsync without --delete is almost
>    guaranteed to produce a corrupt .git if repeated enough times
>    (e.g. git might prefer stale loose refs over now-packed ones).

I'll include that as well.

>  * It's OK to copy .git between system that differ in their support of
>    symbolic links, but the work tree may be in an inconsistent state and
>    need some manner of "git reset" to repair it.

And this.

> And, not sure if this is correct:
> 
>  * It may be OK to edit a .git directory on a non-POSIX conforming fs
>    (but perhaps validate the result with "git fsck"). But it's not OK to
>    have two writing git processes work on such a repository at the same
>    time. Keep in mind that certain operations and default settings (such
>    as background gc, see `gc.autoDetach` in linkgit:git-config[1]) might
>    result in two processes working on the directory even if you're
>    changing it only in one terminal window at a time.

I'll reflect the fact that only one process may modify the repository at
a time.

> I.e. to go a bit beyond the docs you have of basically saying "there be
> dragons in non-POSIX" and describe the particular scenarios where it can
> go wrong. Something like the above still leaves the door open to users
> using cloud syncing services, which they can then judge for themselves
> as being OK or not. I'm sure there's some that are far from POSIX
> compliance that are OK in practice if the above warnings are observed.

Unfortunately, it's a bit hard to make a concise entry that explains all
the different ways that using a non-POSIX compliant file system can
break things.  We've seen NFS where open(2) is broken, both with
permissions and O_EXCL; cloud syncing services that restore files that
have been deleted; DAV mounts that don't support the necessary
semantics; systems where O_APPEND doesn't have POSIX semantics; and a
whole host of other sadness.  I don't therefore think that we want to
tell people that we think that using a file system that doesn't support
POSIX semantics is okay, because in most cases, they are not.

Users frequently try to judge for themselves what works and then they
try one of the above things which clearly does not, so saying, "Try it
and see if it breaks," just tends to result in users complaining about
repository corruption later on.  I'm really tired of answering these
same questions again and again and telling users that their repositories
are hosed and that they've lost data, so I want to be definitive that we
don't recommend or support these various broken environments and that
users should not use them.  It may be in rare cases that users have
extensive knowledge about the behavior of their file systems and Git's
requirements and can make a sound judgment to use a non-POSIX file
system, but the people on the planet who can do that effectively are
almost all Git developers.  I will state that "File systems that don't
provide POSIX semantics, such as DAV mounts, shouldn't be used without
fully understanding the situation and requirements," which I think is
the most generous recommendation we can safely give.
diff mbox series

Patch

diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt
index afdaeab850..042b11e88a 100644
--- a/Documentation/gitfaq.txt
+++ b/Documentation/gitfaq.txt
@@ -241,6 +241,45 @@  How do I know if I want to do a fetch or a pull?::
 	ignore the upstream changes.  A pull consists of a fetch followed
 	immediately by either a merge or rebase.  See linkgit:git-pull[1].
 
+[[syncing-across-computers]]
+How do I sync a Git repository across multiple computers, VMs, or operating systems?::
+	The best way to sync a repository across computers is by pushing and fetching.
+	This uses the native Git mechanisms to transport data efficiently and is the
+	easiest and best way to move data across machines.  If the machines aren't
+	connected by a network, you can use `git bundle` to create a file with your
+	changes and then fetch or pull them from the file on the remote machine.
+	Pushing and fetching are also the only secure ways to interact with a
+	repository you don't own or trust.
++
+However, sometimes people want to sync a repository with a working tree across
+machines.  While this isn't recommended, it can be done with `rsync` (usually
+over an SSH connection), but only when the repository is completely idle (that
+is, no processes, including `git gc`, are modifying it at all).  If `rsync`
+isn't available, you can use `tar` to create a tar archive of the repository and
+copy it to another machine.  Zip files shouldn't be used due to their poor
+support for permissions and symbolic links.
++
+You may also use a shared file system between the two machines that is POSIX
+compliant, such as SSHFS (SFTP) or NFSv4.  If you are using SFTP for this
+purpose, the server should support fsync and POSIX renames (OpenSSH does).  File
+systems that don't provide POSIX semantics, such as DAV mounts, shouldn't be
+used.
++
+Note that you must not work with untrusted working trees, since it's trivial
+for an attacker to set configuration options that will cause arbitrary code to
+be executed on your machine.  Also, in almost all cases when sharing a working
+tree across machines, Git will need to re-read all files the next time you run
+`git status` or otherwise refresh the index, which can be slow.  This generally
+can't be avoided and is part of the reason why sharing a working tree isn't
+recommended.
++
+In no circumstances should you share a working tree or bare repository using a
+cloud syncing service or store it in a directory managed by such a service.
+Such services sync file by file and don't maintain the invariants required for
+repository integrity; in addition, they can cause files to be added, removed, or
+duplicated unexpectedly.  If you must use one of these services, use it to store
+the repository in a tar archive instead.
+
 Merging and Rebasing
 --------------------