[4/4] doc: add a FAQ entry about syncing working trees

Message ID	20211020010624.675562-5-sandals@crustytoothpaste.net (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> From: "brian m. carlson" <sandals@crustytoothpaste.net> To: <git@vger.kernel.org> Cc: Jeff King <peff@peff.net>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, Derrick Stolee <dstolee@microsoft.com> Subject: [PATCH 4/4] doc: add a FAQ entry about syncing working trees Date: Wed, 20 Oct 2021 01:06:23 +0000 Message-Id: <20211020010624.675562-5-sandals@crustytoothpaste.net> In-Reply-To: <20211020010624.675562-1-sandals@crustytoothpaste.net> References: <20211020010624.675562-1-sandals@crustytoothpaste.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Additional FAQ entries \| expand [0/4] Additional FAQ entries [1/4] gitfaq: add advice on monorepos [2/4] gitfaq: add documentation on proxies [3/4] gitfaq: give advice on using eol attribute in gitattributes [4/4] doc: add a FAQ entry about syncing working trees

brian m. carlson Oct. 20, 2021, 1:06 a.m. UTC

Users very commonly want to sync their working tree across machines,
often to carry across in-progress work or stashes.  Despite this not
being a recommended approach, users want to do it and are not dissuaded
by suggestions not to, so let's recommend a sensible technique.

The technique that many users are using is their preferred cloud syncing
service, which is a bad idea.  Users have reported problems where they
end up with duplicate files that won't go away (with names like "file.c
2"), broken references, oddly named references that have date stamps
appended to them, missing objects, and general corruption and data loss.
That's because almost all of these tools sync file by file, which is a
great technique if your project is a single word processing document or
spreadsheet, but is utterly abysmal for Git repositories because they
don't necessarily snapshot the entire repository correctly.  They also
tend to sync the files immediately instead of when the repository is
quiescent, so writing multiple files, as occurs during a commit or a gc,
can confuse the tools and lead to corruption.

We know that the old standby, rsync, is up to the task, provided that
the repository is quiescent, so let's suggest that and dissuade people
from using cloud syncing tools.  Let's tell people about common things
they should be aware of before doing this and that this is still
potentially risky.  Additionally, let's tell people that Git's security
model does not permit sharing working trees across users in case they
planned to do that.  While we'd still prefer users didn't try to do
this, hopefully this will lead them in a safer direction.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/gitfaq.txt | 43 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

Bagas Sanjaya Oct. 20, 2021, 4:58 a.m. UTC | #1

On 20/10/21 08.06, brian m. carlson wrote:
> +Be aware that even with these recommendations, syncing in this way is
> +potentially risky since it bypasses Git's normal integrity checking for
> +repositories, so having backups is advised.
> +

This raises question: how users can backup their repos? The answer is 
same as this FAQ for syncing working trees: have your repo to be pushed 
to the spare server at disposal, or use rsync (as long as the 
preconditions are met).

Philip Oakley Oct. 20, 2021, 2:05 p.m. UTC | #2

On 20/10/2021 05:58, Bagas Sanjaya wrote:
> On 20/10/21 08.06, brian m. carlson wrote:
>> +Be aware that even with these recommendations, syncing in this way is
>> +potentially risky since it bypasses Git's normal integrity checking for
>> +repositories, so having backups is advised.
>> +
>
> This raises question: how users can backup their repos? The answer is
> same as this FAQ for syncing working trees: have your repo to be
> pushed to the spare server at disposal, or use rsync (as long as the
> preconditions are met).
>
Perhaps, for backing up a repo, users might be pointed toward `git
bundle` (with caveats about `--all` being over complete..)

Ævar Arnfjörð Bjarmason Oct. 20, 2021, 11:35 p.m. UTC | #3

On Wed, Oct 20 2021, brian m. carlson wrote:

> +The recommended approach is to use `rsync -a --delete-after` (ideally with an
> +encrypted connection such as with `ssh`) on the root of repository.  You should
> +ensure several things when you do this:

What's the reason to recommend --delete-after in particular? I realize
that e.g. in the .git directory not using *A* delete option *will* cause
corruption, e.g. if you can leave behind stale loose refs with an
up-to-date pack-refs file.

But isn't that equally covered by --delete and --delete-before? I'm not
very well worsed in rsync, but aren't the two equivalent as far as the
end-state goes?

If the intention with --delete-after over --delete or --delete-before is
to somehow make the repository useful during the transfer, doesn't that
go against the later advice of:

> +* The repository is in a quiescent state for the duration of the transfer (that
> +	is, no operations of any sort are taking place on it, including background
> +	operations like `git gc`).

Also for this:

> +Be aware that even with these recommendations, syncing in this way is
> +potentially risky since it bypasses Git's normal integrity checking for
> +repositories, so having backups is advised.

Perhaps we should recommend running a "git gc" or other integrity check
after (or "git fsck"), although those don't cover some cases, e.g. the
pack-refs v.s. loose refs problem in the case of a missing
--delete-whatever.

brian m. carlson Oct. 21, 2021, 12:03 a.m. UTC | #4

On 2021-10-20 at 23:35:43, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 20 2021, brian m. carlson wrote:
> 
> > +The recommended approach is to use `rsync -a --delete-after` (ideally with an
> > +encrypted connection such as with `ssh`) on the root of repository.  You should
> > +ensure several things when you do this:
> 
> What's the reason to recommend --delete-after in particular? I realize
> that e.g. in the .git directory not using *A* delete option *will* cause
> corruption, e.g. if you can leave behind stale loose refs with an
> up-to-date pack-refs file.
> 
> But isn't that equally covered by --delete and --delete-before? I'm not
> very well worsed in rsync, but aren't the two equivalent as far as the
> end-state goes?

Yes.  The goal is that if something goes wrong, you have all the objects
you did before, even if you have some potentially invalid refs.  The
goal is to make it a little less risky if you interrupt it with a Ctrl-C
because you realize the destination contained data you wanted.  I always
prefer --delete-after for that reason, assuming the destination has
sufficient disk space.

It shouldn't make a difference in a successful end state, however.
> 
> > +Be aware that even with these recommendations, syncing in this way is
> > +potentially risky since it bypasses Git's normal integrity checking for
> > +repositories, so having backups is advised.
> 
> Perhaps we should recommend running a "git gc" or other integrity check
> after (or "git fsck"), although those don't cover some cases, e.g. the
> pack-refs v.s. loose refs problem in the case of a missing
> --delete-whatever.

I can recommend something like that.

Ævar Arnfjörð Bjarmason Oct. 21, 2021, 12:33 a.m. UTC | #5

On Thu, Oct 21 2021, brian m. carlson wrote:

> [[PGP Signed Part:Undecided]]
> On 2021-10-20 at 23:35:43, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Oct 20 2021, brian m. carlson wrote:
>> 
>> > +The recommended approach is to use `rsync -a --delete-after` (ideally with an
>> > +encrypted connection such as with `ssh`) on the root of repository.  You should
>> > +ensure several things when you do this:
>> 
>> What's the reason to recommend --delete-after in particular? I realize
>> that e.g. in the .git directory not using *A* delete option *will* cause
>> corruption, e.g. if you can leave behind stale loose refs with an
>> up-to-date pack-refs file.
>> 
>> But isn't that equally covered by --delete and --delete-before? I'm not
>> very well worsed in rsync, but aren't the two equivalent as far as the
>> end-state goes?
>
> Yes.  The goal is that if something goes wrong, you have all the objects
> you did before, even if you have some potentially invalid refs.  The
> goal is to make it a little less risky if you interrupt it with a Ctrl-C
> because you realize the destination contained data you wanted.  I always
> prefer --delete-after for that reason, assuming the destination has
> sufficient disk space.

Isn't it preferable to recommend --delete-before for that reason?
I.e. --delete-after will produce subtle corruption of e.g. refs
potentially pointing to the wrong thing.

But if you recommend --delete-before I think (but maybe I'm missing some
cases) that it will be more likely to produce obvious corruption,
e.g. git dying due to missing objects.

Anyway, I'm also happy to just leave this as-is, it just stood out to me
as od..

> It shouldn't make a difference in a successful end state, however.
>> 
>> > +Be aware that even with these recommendations, syncing in this way is
>> > +potentially risky since it bypasses Git's normal integrity checking for
>> > +repositories, so having backups is advised.
>> 
>> Perhaps we should recommend running a "git gc" or other integrity check
>> after (or "git fsck"), although those don't cover some cases, e.g. the
>> pack-refs v.s. loose refs problem in the case of a missing
>> --delete-whatever.
>
> I can recommend something like that.

...or just leave it as-is is also fine with me, whatever you think is
best.

[4/4] doc: add a FAQ entry about syncing working trees

Commit Message

Comments

Patch