[GSoC,v4,2/7] clone: better handle symlinked files at .git/objects/
diff mbox series

Message ID 20190322232237.13293-3-matheus.bernardino@usp.br
State New
Headers show
Series
  • clone: dir-iterator refactoring with tests
Related show

Commit Message

Matheus Tavares Bernardino March 22, 2019, 11:22 p.m. UTC
There is currently an odd behaviour when locally clonning a repository
with symlinks at .git/objects: using --no-hardlinks all symlinks are
dereferenced but without it Git will try to hardlink the files with the
link() function, which has an OS-specific behaviour on symlinks. On OSX
and NetBSD, it creates a hardlink to the file pointed by the symlink
whilst on GNU/Linux, it creates a hardlink to the symlink itself.

On Manjaro GNU/Linux:
    $ touch a
    $ ln -s a b
    $ link b c
    $ ls -li a b c
    155 [...] a
    156 [...] b -> a
    156 [...] c -> a

But on NetBSD:
    $ ls -li a b c
    2609160 [...] a
    2609164 [...] b -> a
    2609160 [...] c

It's not good to have the result of a local clone to be OS-dependent and
since the behaviour on GNU/Linux may result in broken symlinks, let's
re-implement it with linkat() instead of link() using a flag to always
follow symlinks and make the hardlink be to the pointed file. With this,
besides standardizing the behaviour, no broken symlinks will be
produced. Also, add tests for symlinked files at .git/objects/.

Note: Git won't create symlinks at .git/objects itself, but it's better
to handle this case and be friendly with users who manually create them.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Co-authored-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/clone.c            |  2 +-
 t/t5604-clone-reference.sh | 26 +++++++++++++++++++-------
 2 files changed, 20 insertions(+), 8 deletions(-)

Comments

Thomas Gummerer March 28, 2019, 10:10 p.m. UTC | #1
On 03/22, Matheus Tavares wrote:
> There is currently an odd behaviour when locally clonning a repository
> with symlinks at .git/objects: using --no-hardlinks all symlinks are
> dereferenced but without it Git will try to hardlink the files with the
> link() function, which has an OS-specific behaviour on symlinks. On OSX
> and NetBSD, it creates a hardlink to the file pointed by the symlink
> whilst on GNU/Linux, it creates a hardlink to the symlink itself.
> 
> On Manjaro GNU/Linux:
>     $ touch a
>     $ ln -s a b
>     $ link b c
>     $ ls -li a b c
>     155 [...] a
>     156 [...] b -> a
>     156 [...] c -> a
> 
> But on NetBSD:
>     $ ls -li a b c
>     2609160 [...] a
>     2609164 [...] b -> a
>     2609160 [...] c
> 
> It's not good to have the result of a local clone to be OS-dependent and
> since the behaviour on GNU/Linux may result in broken symlinks, let's
> re-implement it with linkat() instead of link() using a flag to always
> follow symlinks and make the hardlink be to the pointed file. With this,
> besides standardizing the behaviour, no broken symlinks will be
> produced. Also, add tests for symlinked files at .git/objects/.
> 
> Note: Git won't create symlinks at .git/objects itself, but it's better
> to handle this case and be friendly with users who manually create them.
> 
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> Co-authored-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  builtin/clone.c            |  2 +-
>  t/t5604-clone-reference.sh | 26 +++++++++++++++++++-------
>  2 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/builtin/clone.c b/builtin/clone.c
> index 50bde99618..b76f33c635 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -443,7 +443,7 @@ static void copy_or_link_directory(struct strbuf *src, struct strbuf *dest,
>  		if (unlink(dest->buf) && errno != ENOENT)
>  			die_errno(_("failed to unlink '%s'"), dest->buf);
>  		if (!option_no_hardlinks) {
> -			if (!link(src->buf, dest->buf))
> +			if (!linkat(AT_FDCWD, src->buf, AT_FDCWD, dest->buf, AT_SYMLINK_FOLLOW))

This line is starting to get a bit long, might be worth breaking it up
to keep to 80 characters per line.

I notice that we are currently not using 'linkat()' anywhere else in
our codebase.  It looks like it has been introduced in POSIX.1-2008,
which sounds fairly recent by git's standards.  So I wonder if this is
really supported on all platforms that git is being built on.

I also wonder what would need to be done on Windows if we were to
introduce this.  I see we define the 'link()' function in
'compat/mingw.c' for that currently, so I guess something similar
would be needed for 'linkat()'.  I added Dscho to Cc for Windows
expertise.

While I agree with the goal of consistency accross all platforms here,
I don't know if it's actually worth going through the pain of doing
that, especially for somewhat of an edge case in local clones.

If the test in the previous patch passes on all platforms, I'd be okay
with just calling the behaviour here undefined, especially as git
would never actually create symlinks in the .git/objects directory.
Ævar Arnfjörð Bjarmason March 29, 2019, 8:38 a.m. UTC | #2
On Thu, Mar 28 2019, Thomas Gummerer wrote:

> On 03/22, Matheus Tavares wrote:
>> There is currently an odd behaviour when locally clonning a repository
>> with symlinks at .git/objects: using --no-hardlinks all symlinks are
>> dereferenced but without it Git will try to hardlink the files with the
>> link() function, which has an OS-specific behaviour on symlinks. On OSX
>> and NetBSD, it creates a hardlink to the file pointed by the symlink
>> whilst on GNU/Linux, it creates a hardlink to the symlink itself.
>>
>> On Manjaro GNU/Linux:
>>     $ touch a
>>     $ ln -s a b
>>     $ link b c
>>     $ ls -li a b c
>>     155 [...] a
>>     156 [...] b -> a
>>     156 [...] c -> a
>>
>> But on NetBSD:
>>     $ ls -li a b c
>>     2609160 [...] a
>>     2609164 [...] b -> a
>>     2609160 [...] c
>>
>> It's not good to have the result of a local clone to be OS-dependent and
>> since the behaviour on GNU/Linux may result in broken symlinks, let's
>> re-implement it with linkat() instead of link() using a flag to always
>> follow symlinks and make the hardlink be to the pointed file. With this,
>> besides standardizing the behaviour, no broken symlinks will be
>> produced. Also, add tests for symlinked files at .git/objects/.
>>
>> Note: Git won't create symlinks at .git/objects itself, but it's better
>> to handle this case and be friendly with users who manually create them.
>>
>> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> Co-authored-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> ---
>>  builtin/clone.c            |  2 +-
>>  t/t5604-clone-reference.sh | 26 +++++++++++++++++++-------
>>  2 files changed, 20 insertions(+), 8 deletions(-)
>>
>> diff --git a/builtin/clone.c b/builtin/clone.c
>> index 50bde99618..b76f33c635 100644
>> --- a/builtin/clone.c
>> +++ b/builtin/clone.c
>> @@ -443,7 +443,7 @@ static void copy_or_link_directory(struct strbuf *src, struct strbuf *dest,
>>  		if (unlink(dest->buf) && errno != ENOENT)
>>  			die_errno(_("failed to unlink '%s'"), dest->buf);
>>  		if (!option_no_hardlinks) {
>> -			if (!link(src->buf, dest->buf))
>> +			if (!linkat(AT_FDCWD, src->buf, AT_FDCWD, dest->buf, AT_SYMLINK_FOLLOW))
>
> This line is starting to get a bit long, might be worth breaking it up
> to keep to 80 characters per line.
>
> I notice that we are currently not using 'linkat()' anywhere else in
> our codebase.  It looks like it has been introduced in POSIX.1-2008,
> which sounds fairly recent by git's standards.  So I wonder if this is
> really supported on all platforms that git is being built on.
>
> I also wonder what would need to be done on Windows if we were to
> introduce this.  I see we define the 'link()' function in
> 'compat/mingw.c' for that currently, so I guess something similar
> would be needed for 'linkat()'.  I added Dscho to Cc for Windows
> expertise.

For better of worse this particular quest started because I pointed out
(with some WIP patches) that for understanding this change we should
test whatever we did now, to ensure that the refactoring didn't have
unintended side-effects.

But that's a separate question from whether or not we want to keep the
current behavior.

I think the current behavior is clearly insane, so I think we should
change it with some follow-up patches. In particular options like
--dissociate should clearly (in my mind at least) have behavior similar
to "cp -L", and --local should hardlink to the *target* of the symlink,
if anything, at least for objects/{??,pack,info}

I think that changes the portability story with linkat(), since it's not
something we should be planning to keep, just an intermediate step so we
don't have a gigantic patch that both adds tests, refactors and changes
the behavior.

> While I agree with the goal of consistency accross all platforms here,
> I don't know if it's actually worth going through the pain of doing
> that, especially for somewhat of an edge case in local clones.

Note that we explicitly clone everything under objects/, including
recursively cloning unknown directories and their files.

So this is not just say about how we handle symlinks that we don't
expect now (nothing uses them), but if we want to make the promise that
nothing in objects/ will ever use symlinks. Or more specifically, that
if a new version of git starts using it that something doing local
clones might produce a broken copy of such a repo.

Maybe we'll still say "we don't care". Just saying it's a slightly
different question...

> If the test in the previous patch passes on all platforms, I'd be okay
> with just calling the behaviour here undefined, especially as git
> would never actually create symlinks in the .git/objects directory.
Matheus Tavares Bernardino March 29, 2019, 2:27 p.m. UTC | #3
On Thu, Mar 28, 2019 at 7:10 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
>
> On 03/22, Matheus Tavares wrote:
> > There is currently an odd behaviour when locally clonning a repository
> > with symlinks at .git/objects: using --no-hardlinks all symlinks are
> > dereferenced but without it Git will try to hardlink the files with the
> > link() function, which has an OS-specific behaviour on symlinks. On OSX
> > and NetBSD, it creates a hardlink to the file pointed by the symlink
> > whilst on GNU/Linux, it creates a hardlink to the symlink itself.
> >
> > On Manjaro GNU/Linux:
> >     $ touch a
> >     $ ln -s a b
> >     $ link b c
> >     $ ls -li a b c
> >     155 [...] a
> >     156 [...] b -> a
> >     156 [...] c -> a
> >
> > But on NetBSD:
> >     $ ls -li a b c
> >     2609160 [...] a
> >     2609164 [...] b -> a
> >     2609160 [...] c
> >
> > It's not good to have the result of a local clone to be OS-dependent and
> > since the behaviour on GNU/Linux may result in broken symlinks, let's
> > re-implement it with linkat() instead of link() using a flag to always
> > follow symlinks and make the hardlink be to the pointed file. With this,
> > besides standardizing the behaviour, no broken symlinks will be
> > produced. Also, add tests for symlinked files at .git/objects/.
> >
> > Note: Git won't create symlinks at .git/objects itself, but it's better
> > to handle this case and be friendly with users who manually create them.
> >
> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> > Co-authored-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> > ---
> >  builtin/clone.c            |  2 +-
> >  t/t5604-clone-reference.sh | 26 +++++++++++++++++++-------
> >  2 files changed, 20 insertions(+), 8 deletions(-)
> >
> > diff --git a/builtin/clone.c b/builtin/clone.c
> > index 50bde99618..b76f33c635 100644
> > --- a/builtin/clone.c
> > +++ b/builtin/clone.c
> > @@ -443,7 +443,7 @@ static void copy_or_link_directory(struct strbuf *src, struct strbuf *dest,
> >               if (unlink(dest->buf) && errno != ENOENT)
> >                       die_errno(_("failed to unlink '%s'"), dest->buf);
> >               if (!option_no_hardlinks) {
> > -                     if (!link(src->buf, dest->buf))
> > +                     if (!linkat(AT_FDCWD, src->buf, AT_FDCWD, dest->buf, AT_SYMLINK_FOLLOW))
>
> This line is starting to get a bit long, might be worth breaking it up
> to keep to 80 characters per line.
>
> I notice that we are currently not using 'linkat()' anywhere else in
> our codebase.  It looks like it has been introduced in POSIX.1-2008,
> which sounds fairly recent by git's standards.  So I wonder if this is
> really supported on all platforms that git is being built on.
>
> I also wonder what would need to be done on Windows if we were to
> introduce this.  I see we define the 'link()' function in
> 'compat/mingw.c' for that currently, so I guess something similar
> would be needed for 'linkat()'.  I added Dscho to Cc for Windows
> expertise.

Ok, what if instead of using linkat() we use 'realpath(const char
*path, char *resolved_path)', which will resolve any symlinks at
'path' and store the canonical path at 'resolved_path'? Then, we can
still keep using link() but now, with the certainty that all platforms
will have a consistent behaviour? (also, realpath() is POSIX.1-2001)
Would that be a better idea?

> While I agree with the goal of consistency accross all platforms here,
> I don't know if it's actually worth going through the pain of doing
> that, especially for somewhat of an edge case in local clones.
>
> If the test in the previous patch passes on all platforms, I'd be okay
> with just calling the behaviour here undefined, especially as git
> would never actually create symlinks in the .git/objects directory.
Johannes Schindelin March 29, 2019, 3:40 p.m. UTC | #4
Hi Thomas,

On Thu, 28 Mar 2019, Thomas Gummerer wrote:

> On 03/22, Matheus Tavares wrote:
> >
> > diff --git a/builtin/clone.c b/builtin/clone.c
> > index 50bde99618..b76f33c635 100644
> > --- a/builtin/clone.c
> > +++ b/builtin/clone.c
> > @@ -443,7 +443,7 @@ static void copy_or_link_directory(struct strbuf *src, struct strbuf *dest,
> >  		if (unlink(dest->buf) && errno != ENOENT)
> >  			die_errno(_("failed to unlink '%s'"), dest->buf);
> >  		if (!option_no_hardlinks) {
> > -			if (!link(src->buf, dest->buf))
> > +			if (!linkat(AT_FDCWD, src->buf, AT_FDCWD, dest->buf, AT_SYMLINK_FOLLOW))
>
> [...]
>
> I notice that we are currently not using 'linkat()' anywhere else in
> our codebase.  It looks like it has been introduced in POSIX.1-2008,
> which sounds fairly recent by git's standards.  So I wonder if this is
> really supported on all platforms that git is being built on.

I bet you it isn't.

> I also wonder what would need to be done on Windows if we were to
> introduce this.  I see we define the 'link()' function in
> 'compat/mingw.c' for that currently, so I guess something similar
> would be needed for 'linkat()'.  I added Dscho to Cc for Windows
> expertise.

Indeed, `linkat()` would have to be implemented in `compat/mingw.c`. It
would be a bit involved because the last parameter of that function
changes behavior noticeably, but the main difficulty (to determine the
path from a file descriptor) should be overcome using
`HANDLE olddirhandle = _get_osfhandle(olddirfd);` and the calling
`GetFinalPathNameByHandleW(olddirhandle, wbuf, sizeof(wbuf));`.

So yes, this is *not* something I'd do lightly.

The bigger problem will be to continue to support older Unices such as
SunOS and AIX. I highly doubt that they have that function. You should
find out, Matheus.

Ciao,
Johannes
Thomas Gummerer March 29, 2019, 8:05 p.m. UTC | #5
On 03/29, Matheus Tavares Bernardino wrote:
> On Thu, Mar 28, 2019 at 7:10 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > I notice that we are currently not using 'linkat()' anywhere else in
> > our codebase.  It looks like it has been introduced in POSIX.1-2008,
> > which sounds fairly recent by git's standards.  So I wonder if this is
> > really supported on all platforms that git is being built on.
> >
> > I also wonder what would need to be done on Windows if we were to
> > introduce this.  I see we define the 'link()' function in
> > 'compat/mingw.c' for that currently, so I guess something similar
> > would be needed for 'linkat()'.  I added Dscho to Cc for Windows
> > expertise.
> 
> Ok, what if instead of using linkat() we use 'realpath(const char
> *path, char *resolved_path)', which will resolve any symlinks at
> 'path' and store the canonical path at 'resolved_path'? Then, we can
> still keep using link() but now, with the certainty that all platforms
> will have a consistent behaviour? (also, realpath() is POSIX.1-2001)
> Would that be a better idea?

Yeah, I think that is a good idea.  Note that 'realpath()' itself is
not used anywhere in our codebase either, but there is
'strbuf_realpath()', that from reading the function documentation does
exactly what 'realpath()' would do.  So using 'strbuf_realpath()'
would probably be the right thing to do here.
Thomas Gummerer March 29, 2019, 8:15 p.m. UTC | #6
On 03/29, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Mar 28 2019, Thomas Gummerer wrote:
> > I notice that we are currently not using 'linkat()' anywhere else in
> > our codebase.  It looks like it has been introduced in POSIX.1-2008,
> > which sounds fairly recent by git's standards.  So I wonder if this is
> > really supported on all platforms that git is being built on.
> >
> > I also wonder what would need to be done on Windows if we were to
> > introduce this.  I see we define the 'link()' function in
> > 'compat/mingw.c' for that currently, so I guess something similar
> > would be needed for 'linkat()'.  I added Dscho to Cc for Windows
> > expertise.
> 
> For better of worse this particular quest started because I pointed out
> (with some WIP patches) that for understanding this change we should
> test whatever we did now, to ensure that the refactoring didn't have
> unintended side-effects.
> 
> But that's a separate question from whether or not we want to keep the
> current behavior.
> 
> I think the current behavior is clearly insane, so I think we should
> change it with some follow-up patches. In particular options like
> --dissociate should clearly (in my mind at least) have behavior similar
> to "cp -L", and --local should hardlink to the *target* of the symlink,
> if anything, at least for objects/{??,pack,info}

Right, I definitely agree with all of that.  Adding tests for the
current behaviour is definitely a good thing if we can do it in a sane
way.  And I also agree that the current behaviour is insane, and
should be fixed, but that may not want to be part of this patch
series.

> I think that changes the portability story with linkat(), since it's not
> something we should be planning to keep, just an intermediate step so we
> don't have a gigantic patch that both adds tests, refactors and changes
> the behavior.

Fair enough, but that also means that this patch series necessarily
has to introduce the changes in behaviour as well as switching clone
to use dir-iterator.  Of course we could say that the switch-over to
using dir-iterator could be done as a separate patch series, but that
seems a bit too much of a change in scope of this series.

Now I think Matheus has actually found a nice solution to this issue
using 'strbuf_readlink()', which gives us the same behaviour as using
'linkat()' in this patch would give us, so this might not be that big
an issue in the end.
Matheus Tavares Bernardino March 30, 2019, 5:32 a.m. UTC | #7
On Fri, Mar 29, 2019 at 5:05 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
>
> On 03/29, Matheus Tavares Bernardino wrote:
> > On Thu, Mar 28, 2019 at 7:10 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > > I notice that we are currently not using 'linkat()' anywhere else in
> > > our codebase.  It looks like it has been introduced in POSIX.1-2008,
> > > which sounds fairly recent by git's standards.  So I wonder if this is
> > > really supported on all platforms that git is being built on.
> > >
> > > I also wonder what would need to be done on Windows if we were to
> > > introduce this.  I see we define the 'link()' function in
> > > 'compat/mingw.c' for that currently, so I guess something similar
> > > would be needed for 'linkat()'.  I added Dscho to Cc for Windows
> > > expertise.
> >
> > Ok, what if instead of using linkat() we use 'realpath(const char
> > *path, char *resolved_path)', which will resolve any symlinks at
> > 'path' and store the canonical path at 'resolved_path'? Then, we can
> > still keep using link() but now, with the certainty that all platforms
> > will have a consistent behaviour? (also, realpath() is POSIX.1-2001)
> > Would that be a better idea?
>
> Yeah, I think that is a good idea.  Note that 'realpath()' itself is
> not used anywhere in our codebase either, but there is
> 'strbuf_realpath()', that from reading the function documentation does
> exactly what 'realpath()' would do.  So using 'strbuf_realpath()'
> would probably be the right thing to do here.

Thanks. While I was looking for realpath() at git codebase (before I
saw your email), I got a little confused: Besides strbuf_realpath() I
also found real_path(), real_path_if_valid() and real_pathdup(). All
these last three use strbuf_realpath() but they also initialize the
struct strbuf internally and just return a 'char *', which is much
convenient in some cases. What seems weird to me is that, whilst
real_pathdup() releases the internally initialized struct strubuf
(leaving just the returned string to be free'd by the user), the other
two don't. So, if struct strbuf change in the future to have more
dynamic allocated resources, these functions will also have to be
modified. Also, since real_pathdup() can already do what the other two
do, do you know if there is a reason to keep all of them?

One last question: I found some places which don't free the string
returned by, for example, real_path() (e.g., find_worktree() at
worktree.c). Would it be a valid/good patch (or patches) to add free()
calls in this places? (I'm currently trying to get more people here at
USP to contribute to git, and maybe this could be a nice first
contribution for them...)
Thomas Gummerer March 30, 2019, 7:27 p.m. UTC | #8
On 03/30, Matheus Tavares Bernardino wrote:
> On Fri, Mar 29, 2019 at 5:05 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
> >
> > On 03/29, Matheus Tavares Bernardino wrote:
> > > Ok, what if instead of using linkat() we use 'realpath(const char
> > > *path, char *resolved_path)', which will resolve any symlinks at
> > > 'path' and store the canonical path at 'resolved_path'? Then, we can
> > > still keep using link() but now, with the certainty that all platforms
> > > will have a consistent behaviour? (also, realpath() is POSIX.1-2001)
> > > Would that be a better idea?
> >
> > Yeah, I think that is a good idea.  Note that 'realpath()' itself is
> > not used anywhere in our codebase either, but there is
> > 'strbuf_realpath()', that from reading the function documentation does
> > exactly what 'realpath()' would do.  So using 'strbuf_realpath()'
> > would probably be the right thing to do here.
> 
> Thanks. While I was looking for realpath() at git codebase (before I
> saw your email), I got a little confused: Besides strbuf_realpath() I
> also found real_path(), real_path_if_valid() and real_pathdup(). All
> these last three use strbuf_realpath() but they also initialize the
> struct strbuf internally and just return a 'char *', which is much
> convenient in some cases.

Right, feel free to use whichever is most convenient for you, and
whichever works in the context.

>                            What seems weird to me is that, whilst
> real_pathdup() releases the internally initialized struct strubuf
> (leaving just the returned string to be free'd by the user), the other
> two don't. So, if struct strbuf change in the future to have more
> dynamic allocated resources, these functions will also have to be
> modified. Also, since real_pathdup() can already do what the other two
> do, do you know if there is a reason to keep all of them?

Right, '*dup()' functions usually leave the return value to be free'd
by the caller.  And while 'real_pathdup()' could do what the others do
already it also takes more effort to use it.  Users don't need to free
the return value from 'real_path()' to avoid a memory leak.  This
alone justifies its existence I think.

> One last question: I found some places which don't free the string
> returned by, for example, real_path() (e.g., find_worktree() at
> worktree.c). Would it be a valid/good patch (or patches) to add free()
> calls in this places? (I'm currently trying to get more people here at
> USP to contribute to git, and maybe this could be a nice first
> contribution for them...)

Trying to plug memory leaks in the codebase is definitely something
that I think is worthy of doing.  Sometimes it's not worth actually
free'ing the memory, for example just before the program exits, in
which case we can use the UNLEAK annotation.  It was introduced in
0e5bba53af ("add UNLEAK annotation for reducing leak false positives",
2017-09-08) if you want more background.

That said, the memory from 'real_path()' should actually not be
free'd.  The strbuf there has a static lifetime, so it is valid until
git exits.  If we were to free the return value of the function we'd
actually free an internal buffer of the strbuf, that is still valid.
So if someone were to use 'real_path()' after that, the memory that
strbuf still thinks it owns would actually have been free'd, which
would result in undefined behaviour, and probably would make git
segfault.
Matheus Tavares Bernardino April 1, 2019, 3:56 a.m. UTC | #9
On Sat, Mar 30, 2019 at 4:27 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
>
> On 03/30, Matheus Tavares Bernardino wrote:
> > On Fri, Mar 29, 2019 at 5:05 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > >
> > > On 03/29, Matheus Tavares Bernardino wrote:
> > > > Ok, what if instead of using linkat() we use 'realpath(const char
> > > > *path, char *resolved_path)', which will resolve any symlinks at
> > > > 'path' and store the canonical path at 'resolved_path'? Then, we can
> > > > still keep using link() but now, with the certainty that all platforms
> > > > will have a consistent behaviour? (also, realpath() is POSIX.1-2001)
> > > > Would that be a better idea?
> > >
> > > Yeah, I think that is a good idea.  Note that 'realpath()' itself is
> > > not used anywhere in our codebase either, but there is
> > > 'strbuf_realpath()', that from reading the function documentation does
> > > exactly what 'realpath()' would do.  So using 'strbuf_realpath()'
> > > would probably be the right thing to do here.
> >
> > Thanks. While I was looking for realpath() at git codebase (before I
> > saw your email), I got a little confused: Besides strbuf_realpath() I
> > also found real_path(), real_path_if_valid() and real_pathdup(). All
> > these last three use strbuf_realpath() but they also initialize the
> > struct strbuf internally and just return a 'char *', which is much
> > convenient in some cases.
>
> Right, feel free to use whichever is most convenient for you, and
> whichever works in the context.
>
> >                            What seems weird to me is that, whilst
> > real_pathdup() releases the internally initialized struct strubuf
> > (leaving just the returned string to be free'd by the user), the other
> > two don't. So, if struct strbuf change in the future to have more
> > dynamic allocated resources, these functions will also have to be
> > modified. Also, since real_pathdup() can already do what the other two
> > do, do you know if there is a reason to keep all of them?
>
> Right, '*dup()' functions usually leave the return value to be free'd
> by the caller.  And while 'real_pathdup()' could do what the others do
> already it also takes more effort to use it.  Users don't need to free
> the return value from 'real_path()' to avoid a memory leak.  This
> alone justifies its existence I think.
>
> > One last question: I found some places which don't free the string
> > returned by, for example, real_path() (e.g., find_worktree() at
> > worktree.c). Would it be a valid/good patch (or patches) to add free()
> > calls in this places? (I'm currently trying to get more people here at
> > USP to contribute to git, and maybe this could be a nice first
> > contribution for them...)
>
> Trying to plug memory leaks in the codebase is definitely something
> that I think is worthy of doing.  Sometimes it's not worth actually
> free'ing the memory, for example just before the program exits, in
> which case we can use the UNLEAK annotation.  It was introduced in
> 0e5bba53af ("add UNLEAK annotation for reducing leak false positives",
> 2017-09-08) if you want more background.
>
> That said, the memory from 'real_path()' should actually not be
> free'd.  The strbuf there has a static lifetime, so it is valid until
> git exits.  If we were to free the return value of the function we'd
> actually free an internal buffer of the strbuf, that is still valid.
> So if someone were to use 'real_path()' after that, the memory that
> strbuf still thinks it owns would actually have been free'd, which
> would result in undefined behaviour, and probably would make git
> segfault.
>

Thanks for the great explanation, Thomas. I hadn't noticed that the
strbuf variable inside real_path() is declared as static. I also took
some time, now, to better understand how strbuf functions deal with
the buf attribute (especially how it's realloc'ed) and now I think I
understand it better. Thanks again for the help!

> --
> You received this message because you are subscribed to the Google Groups "Kernel USP" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-usp+unsubscribe@googlegroups.com.
> To post to this group, send email to kernel-usp@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kernel-usp/20190330192738.GQ32487%40hank.intra.tgummerer.com.
> For more options, visit https://groups.google.com/d/optout.

Patch
diff mbox series

diff --git a/builtin/clone.c b/builtin/clone.c
index 50bde99618..b76f33c635 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -443,7 +443,7 @@  static void copy_or_link_directory(struct strbuf *src, struct strbuf *dest,
 		if (unlink(dest->buf) && errno != ENOENT)
 			die_errno(_("failed to unlink '%s'"), dest->buf);
 		if (!option_no_hardlinks) {
-			if (!link(src->buf, dest->buf))
+			if (!linkat(AT_FDCWD, src->buf, AT_FDCWD, dest->buf, AT_SYMLINK_FOLLOW))
 				continue;
 			if (option_local > 0)
 				die_errno(_("failed to create link '%s'"), dest->buf);
diff --git a/t/t5604-clone-reference.sh b/t/t5604-clone-reference.sh
index 708b1a2c66..76d45f1187 100755
--- a/t/t5604-clone-reference.sh
+++ b/t/t5604-clone-reference.sh
@@ -266,7 +266,7 @@  test_expect_success 'clone a repo with garbage in objects/*' '
 	test_cmp expected actual
 '
 
-test_expect_success SYMLINKS 'setup repo with manually symlinked dirs and unknown files at objects/' '
+test_expect_success SYMLINKS 'setup repo with manually symlinked or unknown files at objects/' '
 	git init T &&
 	(
 		cd T &&
@@ -282,10 +282,18 @@  test_expect_success SYMLINKS 'setup repo with manually symlinked dirs and unknow
 			cd .git/objects &&
 			find ?? -type d >loose-dirs &&
 			last_loose=$(tail -n 1 loose-dirs) &&
-			rm -f loose-dirs &&
 			mv $last_loose a-loose-dir &&
 			ln -s a-loose-dir $last_loose &&
+			first_loose=$(head -n 1 loose-dirs) &&
+			rm -f loose-dirs &&
+			(
+				cd $first_loose &&
+				obj=$(ls *) &&
+				mv $obj ../an-object &&
+				ln -s ../an-object $obj
+			) &&
 			find . -type f | sort >../../../T.objects-files.raw &&
+			find . -type l | sort >../../../T.objects-symlinks.raw &&
 			echo unknown_content> unknown_file
 		)
 	) &&
@@ -294,7 +302,7 @@  test_expect_success SYMLINKS 'setup repo with manually symlinked dirs and unknow
 '
 
 
-test_expect_success SYMLINKS 'clone repo with symlinked dirs and unknown files at objects/' '
+test_expect_success SYMLINKS 'clone repo with symlinked or unknown files at objects/' '
 	for option in --local --no-hardlinks --shared --dissociate
 	do
 		git clone $option T T$option || return 1 &&
@@ -303,7 +311,8 @@  test_expect_success SYMLINKS 'clone repo with symlinked dirs and unknown files a
 		test_cmp T.objects T$option.objects &&
 		(
 			cd T$option/.git/objects &&
-			find . -type f | sort >../../../T$option.objects-files.raw
+			find . -type f | sort >../../../T$option.objects-files.raw &&
+			find . -type l | sort >../../../T$option.objects-symlinks.raw
 		)
 	done &&
 
@@ -317,6 +326,7 @@  test_expect_success SYMLINKS 'clone repo with symlinked dirs and unknown files a
 	./Y/Z
 	./Y/Z
 	./a-loose-dir/Z
+	./an-object
 	./Y/Z
 	./info/packs
 	./pack/pack-Z.idx
@@ -326,15 +336,17 @@  test_expect_success SYMLINKS 'clone repo with symlinked dirs and unknown files a
 	./unknown_file
 	EOF
 
-	for option in --local --dissociate --no-hardlinks
+	for option in --local --no-hardlinks --dissociate
 	do
-		test_cmp expected-files T$option.objects-files.raw.de-sha || return 1
+		test_cmp expected-files T$option.objects-files.raw.de-sha || return 1 &&
+		test_must_be_empty T$option.objects-symlinks.raw.de-sha || return 1
 	done &&
 
 	cat >expected-files <<-EOF &&
 	./info/alternates
 	EOF
-	test_cmp expected-files T--shared.objects-files.raw
+	test_cmp expected-files T--shared.objects-files.raw &&
+	test_must_be_empty T--shared.objects-symlinks.raw
 '
 
 test_done