diff mbox series

[v4,4/5] ref: add symref content check for files backend

Message ID ZuRzzwZds8ys-JEN@ArchLinux (mailing list archive)
State New, archived
Headers show
Series [v4,1/5] ref: initialize "fsck_ref_report" with zero | expand

Commit Message

shejialuo Sept. 13, 2024, 5:18 p.m. UTC
We have already introduced the checks for regular refs. There is no need
to check the consistency of the target which the symref points to.
Instead, we just need to check the content of the symref itself.

A regular file is accepted as a textual symref if it begins with
"ref:", followed by zero or more whitespaces, followed by the full
refname, followed only by whitespace characters. We always write
a single SP after "ref:" and a single LF after the refname, but
third-party reimplementations of Git may have taken advantage of the
looser syntax. Put it more specific, we accept the following contents
of the symref:

1. "ref: refs/heads/master   "
2. "ref: refs/heads/master   \n  \n"
3. "ref: refs/heads/master\n\n"

Thus, we could reuse "refMissingNewline" and "trailingRefContent"
FSCK_INFOs to do the same retroactive tightening as we introduce for
regular references.

But we do not allow any other trailing garbage. The followings are bad
symref contents which will be reported as fsck error by "git-fsck(1)".

1. "ref: refs/heads/master garbage\n"
2. "ref: refs/heads/master \n\n\n garbage  "

And we introduce a new "badReferentName(ERROR)" fsck message to report
above errors to the user.

In order to check the content of the symref, create a function
"files_fsck_symref_target". It will first check whether the "referent"
is under the "refs/" directory, if not, we will report "escapeReferent"
fsck error message to notify the user this situation.

Then, we will first check whether the symref content misses the newline
by peeking the last byte of the "referent" to see whether it is '\n'.

And we will remember the untrimmed length of the "referent" and call
"strbuf_rtrim()" on "referent". Then, we will call "check_refname_format"
to check whether the trimmed referent format is valid. If not, we will
report to the user that the symref points to referent which has invalid
format. If it is valid, we will compare the untrimmed length and trimmed
length, if they are not the same, we need to warn the user there is some
trailing garbage in the symref content.

At last, we need to check whether the referent is a directory. We cannot
distinguish whether a given reference like "refs/heads/a" is a file or a
directory by using "check_refname_format". We have already checked bad
file type when iterating the "refs/" directory but we ignore the
directory. Thus, we need to explicitly add check here.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
---
 Documentation/fsck-msgids.txt |   9 +++
 fsck.h                        |   3 +
 refs/files-backend.c          |  81 +++++++++++++++++++++++
 t/t0602-reffiles-fsck.sh      | 117 ++++++++++++++++++++++++++++++++++
 4 files changed, 210 insertions(+)

Comments

Junio C Hamano Sept. 18, 2024, 8:19 p.m. UTC | #1
shejialuo <shejialuo@gmail.com> writes:

Expect that people do not read the body of the message as completing
a paragrpah the title started.  I.e. ...

> We have already introduced the checks for regular refs. There is no need
> to check the consistency of the target which the symref points to.
> Instead, we just need to check the content of the symref itself.

... this needs a bit of preamble, like

    We have code that check regular ref contents, but we do not yet
    check contents of symbolic refs.

> A regular file is accepted as a textual symref if it begins with
> "ref:", followed by zero or more whitespaces, followed by the full
> refname, followed only by whitespace characters. We always write
> a single SP after "ref:" and a single LF after the refname, but
> third-party reimplementations of Git may have taken advantage of the
> looser syntax. Put it more specific, we accept the following contents
> of the symref:
>
> 1. "ref: refs/heads/master   "
> 2. "ref: refs/heads/master   \n  \n"
> 3. "ref: refs/heads/master\n\n"
>
> Thus, we could reuse "refMissingNewline" and "trailingRefContent"
> FSCK_INFOs to do the same retroactive tightening as we introduce for
> regular references.
>
> But we do not allow any other trailing garbage. The followings are bad
> symref contents which will be reported as fsck error by "git-fsck(1)".

This description needs to be updated, as it is unclear if you are
talking about errors we already detect, or if you are planning to
update fsck to notice and report these errors.

> 1. "ref: refs/heads/master garbage\n"
> 2. "ref: refs/heads/master \n\n\n garbage  "
>
> And we introduce a new "badReferentName(ERROR)" fsck message to report
> above errors to the user.

OK.

> In order to check the content of the symref, create a function
> "files_fsck_symref_target". It will first check whether the "referent"
> is under the "refs/" directory, if not, we will report "escapeReferent"
> fsck error message to notify the user this situation.
>
> Then, we will first check whether the symref content misses the newline
> by peeking the last byte of the "referent" to see whether it is '\n'.

"Then, we will first" -> "Then it checks" or something.

You already consumed "first" for the check to limit the referent to
those under "refs/" hierarchy.

> And we will remember the untrimmed length of the "referent" and call
> "strbuf_rtrim()" on "referent". Then, we will call "check_refname_format"
> to check whether the trimmed referent format is valid. If not, we will
> report to the user that the symref points to referent which has invalid
> format. If it is valid, we will compare the untrimmed length and trimmed
> length, if they are not the same, we need to warn the user there is some
> trailing garbage in the symref content.

That is an implementation detail of what you did.  But if the
implementation were buggy and did not exactly what you intended to
do, the above description gives no information to help others to fix
it up so that it works as you intended it to work, because you do
not explain it.

So what did you want to achieve in the third step (the first being
"limit to refs/ hiararchy", the second being "no incomplete lines
allowed")?

    Third, we want to make sure that the contents of a textual
    symref MUST have a single LF after the target refname and
    NOTHING ELSE.

or something.

> At last, we need to check whether the referent is a directory. We cannot

"a directory" -> "an existing directory"?

I am not comfortable to see the word "directory" used in this
proposed log message, as some refs could be stored in the packed
backend and are referenced by the symbolic ref you are inspecting
(this comment also refers to the "refs/ directory" you mentioned
earlier as "the first check").

    Lastly, a symbolic ref MUST either point to an existing ref,
    or if the referent does not exist, it MUST NOT be a leading
    subpath for another existing ref (e.g., when "refs/heads/main"
    exists, a symbolic ref that points at "refs/heads" is a no-no).

or something (but again, I am open to a phrasing better than
"subpath").

Design question.  What do we want to do when we have no loose refs
under the "refs/heads/historical/" hiearchy, (i.e. all of them are
in packed-refs file) hence ".git/refs/heads/historical" directory
does not exist on the filesystem.  And a symbolic ref points at
"refs/heads/historical".  Shouldn't we give the same error whether
the .git/refs/heads/historical directory exist or not, as long as
the refs/heads/historical/main branch exists (in the packed-refs
backend)?

> diff --git a/Documentation/fsck-msgids.txt b/Documentation/fsck-msgids.txt
> index 8827137ef0..03bcb77972 100644
> --- a/Documentation/fsck-msgids.txt
> +++ b/Documentation/fsck-msgids.txt
> @@ -28,6 +28,12 @@
>  `badRefName`::
>  	(ERROR) A ref has an invalid format.
>  
> +`badReferentFiletype`::
> +	(ERROR) The referent of a symref has a bad file type.
> +
> +`badReferentName`::
> +	(ERROR) The referent name of a symref is invalid.
> +
>  `badTagName`::
>  	(INFO) A tag has an invalid format.
>  
> @@ -49,6 +55,9 @@
>  `emptyName`::
>  	(WARN) A path contains an empty name.
>  
> +`escapeReferent`::
> +	(ERROR) The referent of a symref is outside the "ref" directory.

I am not sure starting this as ERROR is wise.  Users and third-party
tools make creative uses of the system and I cannot offhand think of
an argument why it should be forbidden to create a symbolic link to
our own HEAD or to some worktree-specific ref in another worktree.

> +	size_t len = referent->len - 1;
> +	struct stat st;
> +	int ret = 0;
> +
> +	if (!starts_with(referent->buf, "refs/")) {
> +		ret = fsck_report_ref(o, report,
> +				      FSCK_MSG_ESCAPE_REFERENT,
> +				      "points to ref outside the refs directory");
> +		goto out;
> +	}
> +
> +	if (referent->buf[referent->len - 1] != '\n') {

As you initialized "len" to "referent->len-1" earlier, wouldn't it
more natural to use it here?  That would match the incrementing of
len++ later in this block.

> +		ret = fsck_report_ref(o, report,
> +				      FSCK_MSG_REF_MISSING_NEWLINE,
> +				      "missing newline");
> +		len++;
> +	}

Having said that, the above should be simplified more like:

 * declare but not initialize "len".  better yet, declare "orig_len"
   and leave it uninitialized.

 * do not touch "len++" in the above block (actually, you can
   discard the above "if(it does not end with LF)" block, see
   below).

 * instead grab "referent->len" in "len" (or "orig_len") immediately
   before you first modify referent, i.e. before strbuf_rtrim() call.

	orig_len = referent->len;
	orig_last_byte = referent->buf[orig_len - 1];

> +	strbuf_rtrim(referent);
> +	if (check_refname_format(referent->buf, 0)) {
> +		ret = fsck_report_ref(o, report,
> +				      FSCK_MSG_BAD_REFERENT_NAME,
> +				      "points to refname with invalid format");

Similar to an earlier step, the message does not give any more
information than the enum.  Wouldn't the user who got this error
want to learn what referent->buf said and which part of it was bad
in the same message, instead of having to look it up on their own
after fsck finishes?

> +		goto out;
> +	}

At this point we know check_refname_format() is happy with what is
left after rtrimming the referent.  There are four cases:

 - rtrim() did not trim anything (orig_len == referent->len); the file
   lacked the terminating LF.

 - rtrim() trimmed one byte (orig_len - 1 == referent->len) and
   the byte was not LF (orig_last_byte != '\n').  The file lacked
   the terminating LF.

 - rtrim() trimmed exactly one byte (orig_len - 1 == referent->len)
   and the byte was LF (orig_last_byte == '\n').  There is no error.

 - all other cases, i.e., rtrim() trimmed two or more bytes.  The
   file had trailing whitespaces after a valid referent that passed
   check_refname_format().

So in short,

	if (referent->len == orig_len ||
	    referent->len == orig_len - 1 && orig_last_byte != '\n') {
		FSCK_MSG_REF_MISSING_NEWLINE;
	} else if (referent->len < orig_len - 1) {
		FSCK_MSG_REF_TRAILING_WHITESPACE;
	}

can replace the next block you wrote, and we can also remove the
earlier "it is an error if it does not end with '\n'", I think.

> +	if (len != referent->len) {
> +		ret = fsck_report_ref(o, report,
> +				      FSCK_MSG_TRAILING_REF_CONTENT,
> +				      "trailing garbage in ref");

As check_refname_format() was happy, the difference between orig_len
and referent->len are only coming from trailing whitespaces, i.e. it
is not that it had arbitrary garbage.  Shouldn't we be more explicit
about that?

> +	/*
> +	 * Dangling symrefs are common and so we don't report them.
> +	 */
> +	if (lstat(referent_path->buf, &st)) {
> +		if (errno != ENOENT) {
> +			ret = error_errno(_("unable to stat '%s'"),
> +					  referent_path->buf);
> +		}
> +		goto out;
> +	}
> +
> +	/*
> +	 * We cannot distinguish whether "refs/heads/a" is a directory or not by
> +	 * using "check_refname_format(referent->buf, 0)". Instead, we need to
> +	 * check the file type of the target.
> +	 */
> +	if (S_ISDIR(st.st_mode)) {
> +		ret = fsck_report_ref(o, report,
> +				      FSCK_MSG_BAD_REFERENT_FILETYPE,
> +				      "points to the directory");
> +		goto out;
> +	}

If referent_path->buf refers to "refs/heads/historical/", and all
the branches under the hierarchy have been sent to packed-refs,
then this check will not trigger.

I wonder if this check is the right thing to enforce in the first
place, though.

As far as the end user is concerned, refs/heads/historical/master
branch stil exists, and there is no refs/heads/historical branch, so
such a symbolic ref, for all intents and purposes, is the same as
any other dangling symbolic refs, no?

Of course, "git update-ref SUCH_A_SYMREF HEAD" will complain because
there is refs/heads/historical, with something like 

    "refs/heads/historical/master" exists, cannot create "refs/heads/historical"

but that is to be expected.  If you remove the last branch in the
refs/heads/historical hierarchy, you should be able to do such an
update-ref to instanciate refs/heads/historical as a regular ref.

> @@ -3484,12 +3553,24 @@ static int files_fsck_refs_content(struct ref_store *ref_store,
>  					      "trailing garbage in ref");
>  			goto cleanup;
>  		}
> +	} else {
> +		strbuf_addf(&referent_path, "%s/%s",
> +			    ref_store->gitdir, referent.buf);
> +		/*
> +		 * the referent may contain the spaces and the newline, need to
> +		 * trim for path.
> +		 */
> +		strbuf_rtrim(&referent_path);

I doubt this is a good design.  We have referent, and the symbolic
ref checker knows that the true referent refname may be followed by
whitespaces, so instead of inventing referent _path here, it would
be a better design to let the files_fsck_symref_target() to decide
what file to open and check based on referent, no?  Give it the
refstore or refstore's gitdir and have the concatenation with the
rtrimmed contents in the referent->buf after it inspected it
instead, perhaps?

> +		ret = files_fsck_symref_target(o, &report,
> +					       &referent,
> +					       &referent_path);
shejialuo Sept. 22, 2024, 3:53 p.m. UTC | #2
On Wed, Sep 18, 2024 at 01:19:13PM -0700, Junio C Hamano wrote:
> shejialuo <shejialuo@gmail.com> writes:
> 
> Expect that people do not read the body of the message as completing
> a paragrpah the title started.  I.e. ...
> 
> > We have already introduced the checks for regular refs. There is no need
> > to check the consistency of the target which the symref points to.
> > Instead, we just need to check the content of the symref itself.
> 
> ... this needs a bit of preamble, like
> 
>     We have code that check regular ref contents, but we do not yet
>     check contents of symbolic refs.
> 

Thanks, I will improve this in the next version.

> > A regular file is accepted as a textual symref if it begins with
> > "ref:", followed by zero or more whitespaces, followed by the full
> > refname, followed only by whitespace characters. We always write
> > a single SP after "ref:" and a single LF after the refname, but
> > third-party reimplementations of Git may have taken advantage of the
> > looser syntax. Put it more specific, we accept the following contents
> > of the symref:
> >
> > 1. "ref: refs/heads/master   "
> > 2. "ref: refs/heads/master   \n  \n"
> > 3. "ref: refs/heads/master\n\n"
> >
> > Thus, we could reuse "refMissingNewline" and "trailingRefContent"
> > FSCK_INFOs to do the same retroactive tightening as we introduce for
> > regular references.
> >
> > But we do not allow any other trailing garbage. The followings are bad
> > symref contents which will be reported as fsck error by "git-fsck(1)".
> 
> This description needs to be updated, as it is unclear if you are
> talking about errors we already detect, or if you are planning to
> update fsck to notice and report these errors.
> 

Yes, When I was writing this part, I felt a little painful to express my
words. I have thought how could I express the connection between the
current patch and the previous one.

> > And we will remember the untrimmed length of the "referent" and call
> > "strbuf_rtrim()" on "referent". Then, we will call "check_refname_format"
> > to check whether the trimmed referent format is valid. If not, we will
> > report to the user that the symref points to referent which has invalid
> > format. If it is valid, we will compare the untrimmed length and trimmed
> > length, if they are not the same, we need to warn the user there is some
> > trailing garbage in the symref content.
> 
> That is an implementation detail of what you did.  But if the
> implementation were buggy and did not exactly what you intended to
> do, the above description gives no information to help others to fix
> it up so that it works as you intended it to work, because you do
> not explain it.
> 
> So what did you want to achieve in the third step (the first being
> "limit to refs/ hiararchy", the second being "no incomplete lines
> allowed")?
> 
>     Third, we want to make sure that the contents of a textual
>     symref MUST have a single LF after the target refname and
>     NOTHING ELSE.
> 
> or something.
> 

From the above comments, I need to organize the commit message of
this patch to make things clear here.

> "a directory" -> "an existing directory"?
> 
> I am not comfortable to see the word "directory" used in this
> proposed log message, as some refs could be stored in the packed
> backend and are referenced by the symbolic ref you are inspecting
> (this comment also refers to the "refs/ directory" you mentioned
> earlier as "the first check").
> 
>     Lastly, a symbolic ref MUST either point to an existing ref,
>     or if the referent does not exist, it MUST NOT be a leading
>     subpath for another existing ref (e.g., when "refs/heads/main"
>     exists, a symbolic ref that points at "refs/heads" is a no-no).
> 
> or something (but again, I am open to a phrasing better than
> "subpath").
> 
> Design question.  What do we want to do when we have no loose refs
> under the "refs/heads/historical/" hiearchy, (i.e. all of them are
> in packed-refs file) hence ".git/refs/heads/historical" directory
> does not exist on the filesystem.  And a symbolic ref points at
> "refs/heads/historical".  Shouldn't we give the same error whether
> the .git/refs/heads/historical directory exist or not, as long as
> the refs/heads/historical/main branch exists (in the packed-refs
> backend)?
> 

I guess I need to think carefully here. Actually, my intention is that I
want to concentrate on the loose refs and then take consideration about
the packed refs.

However, from what you have said above, it seems I could not do this.
They are connected. But at current, I am not so familiar with packed
refs behavior, I could not answer all the questions above.

I decide to understand what packed-ref done. So, this series may be
stalled sometime until I have a good knowledge and re-think the design
here.

> > +`escapeReferent`::
> > +	(ERROR) The referent of a symref is outside the "ref" directory.
> 
> I am not sure starting this as ERROR is wise.  Users and third-party
> tools make creative uses of the system and I cannot offhand think of
> an argument why it should be forbidden to create a symbolic link to
> our own HEAD or to some worktree-specific ref in another worktree.
> 

Do we allow this cross-access (hack)? It might cause some trouble from
my perspective.

> > +	if (referent->buf[referent->len - 1] != '\n') {
> 
> As you initialized "len" to "referent->len-1" earlier, wouldn't it
> more natural to use it here?  That would match the incrementing of
> len++ later in this block.
> 

Yes, exactly.

> > +		ret = fsck_report_ref(o, report,
> > +				      FSCK_MSG_REF_MISSING_NEWLINE,
> > +				      "missing newline");
> > +		len++;
> > +	}
> 
> Having said that, the above should be simplified more like:
> 
>  * declare but not initialize "len".  better yet, declare "orig_len"
>    and leave it uninitialized.
> 
>  * do not touch "len++" in the above block (actually, you can
>    discard the above "if(it does not end with LF)" block, see
>    below).
> 
>  * instead grab "referent->len" in "len" (or "orig_len") immediately
>    before you first modify referent, i.e. before strbuf_rtrim() call.
> 
> 	orig_len = referent->len;
> 	orig_last_byte = referent->buf[orig_len - 1];
> 

I agree.

> > +	strbuf_rtrim(referent);
> > +	if (check_refname_format(referent->buf, 0)) {
> > +		ret = fsck_report_ref(o, report,
> > +				      FSCK_MSG_BAD_REFERENT_NAME,
> > +				      "points to refname with invalid format");
> 
> Similar to an earlier step, the message does not give any more
> information than the enum.  Wouldn't the user who got this error
> want to learn what referent->buf said and which part of it was bad
> in the same message, instead of having to look it up on their own
> after fsck finishes?
> 

Yes, I agree. I will improve this.

> > +		goto out;
> > +	}
> 
> At this point we know check_refname_format() is happy with what is
> left after rtrimming the referent.  There are four cases:
> 
>  - rtrim() did not trim anything (orig_len == referent->len); the file
>    lacked the terminating LF.
> 
>  - rtrim() trimmed one byte (orig_len - 1 == referent->len) and
>    the byte was not LF (orig_last_byte != '\n').  The file lacked
>    the terminating LF.
> 
>  - rtrim() trimmed exactly one byte (orig_len - 1 == referent->len)
>    and the byte was LF (orig_last_byte == '\n').  There is no error.
> 
>  - all other cases, i.e., rtrim() trimmed two or more bytes.  The
>    file had trailing whitespaces after a valid referent that passed
>    check_refname_format().
> 

That's so clear. My implementation is not good compared with this.

> So in short,
> 
> 	if (referent->len == orig_len ||
> 	    referent->len == orig_len - 1 && orig_last_byte != '\n') {
> 		FSCK_MSG_REF_MISSING_NEWLINE;
> 	} else if (referent->len < orig_len - 1) {
> 		FSCK_MSG_REF_TRAILING_WHITESPACE;
> 	}
> 
> can replace the next block you wrote, and we can also remove the
> earlier "it is an error if it does not end with '\n'", I think.
> 
> > +	if (len != referent->len) {
> > +		ret = fsck_report_ref(o, report,
> > +				      FSCK_MSG_TRAILING_REF_CONTENT,
> > +				      "trailing garbage in ref");
> 
> As check_refname_format() was happy, the difference between orig_len
> and referent->len are only coming from trailing whitespaces, i.e. it
> is not that it had arbitrary garbage.  Shouldn't we be more explicit
> about that?
> 

Yes, I made a lot of mistakes when calling the "fsck_report_ref". I will
report the exact garbage content to the user.

> > +	/*
> > +	 * Dangling symrefs are common and so we don't report them.
> > +	 */
> > +	if (lstat(referent_path->buf, &st)) {
> > +		if (errno != ENOENT) {
> > +			ret = error_errno(_("unable to stat '%s'"),
> > +					  referent_path->buf);
> > +		}
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * We cannot distinguish whether "refs/heads/a" is a directory or not by
> > +	 * using "check_refname_format(referent->buf, 0)". Instead, we need to
> > +	 * check the file type of the target.
> > +	 */
> > +	if (S_ISDIR(st.st_mode)) {
> > +		ret = fsck_report_ref(o, report,
> > +				      FSCK_MSG_BAD_REFERENT_FILETYPE,
> > +				      "points to the directory");
> > +		goto out;
> > +	}
> 
> If referent_path->buf refers to "refs/heads/historical/", and all
> the branches under the hierarchy have been sent to packed-refs,
> then this check will not trigger.
> 

Yes, because "refs/heads/historical" will not appear in the filesystem.

> I wonder if this check is the right thing to enforce in the first
> place, though.
> 
> As far as the end user is concerned, refs/heads/historical/master
> branch stil exists, and there is no refs/heads/historical branch, so
> such a symbolic ref, for all intents and purposes, is the same as
> any other dangling symbolic refs, no?
> 
> Of course, "git update-ref SUCH_A_SYMREF HEAD" will complain because
> there is refs/heads/historical, with something like 
> 
>     "refs/heads/historical/master" exists, cannot create "refs/heads/historical"
> 
> but that is to be expected.  If you remove the last branch in the
> refs/heads/historical hierarchy, you should be able to do such an
> update-ref to instanciate refs/heads/historical as a regular ref.
> 

I am a little shocked here. I do this in action and find the directory
will be automatically converted to a regular file in the filesystem. So,
I agree with you here. We should never check this, because we allow
symref to point to a directory. As long as there is no loose refs and
packed refs under this directory, we could use "git update-ref" for this
symref.

Thanks,

> > @@ -3484,12 +3553,24 @@ static int files_fsck_refs_content(struct ref_store *ref_store,
> >  					      "trailing garbage in ref");
> >  			goto cleanup;
> >  		}
> > +	} else {
> > +		strbuf_addf(&referent_path, "%s/%s",
> > +			    ref_store->gitdir, referent.buf);
> > +		/*
> > +		 * the referent may contain the spaces and the newline, need to
> > +		 * trim for path.
> > +		 */
> > +		strbuf_rtrim(&referent_path);
> 
> I doubt this is a good design.  We have referent, and the symbolic
> ref checker knows that the true referent refname may be followed by
> whitespaces, so instead of inventing referent _path here, it would
> be a better design to let the files_fsck_symref_target() to decide
> what file to open and check based on referent, no?  Give it the
> refstore or refstore's gitdir and have the concatenation with the
> rtrimmed contents in the referent->buf after it inspected it
> instead, perhaps?
> 

Yes, I agree with you here. We should use "files_fsck_symref_target" to
do this.


----

From this review, I think I need to understand more behaviors about
files backend and packed backend. Thanks for your so dedicated reviews.
I may spend more time to send the next version. And there may be some
delay.

Thanks,
Jialuo
Junio C Hamano Sept. 22, 2024, 4:55 p.m. UTC | #3
shejialuo <shejialuo@gmail.com> writes:

>> > +`escapeReferent`::
>> > +	(ERROR) The referent of a symref is outside the "ref" directory.
>> 
>> I am not sure starting this as ERROR is wise.  Users and third-party
>> tools make creative uses of the system and I cannot offhand think of
>> an argument why it should be forbidden to create a symbolic link to
>> our own HEAD or to some worktree-specific ref in another worktree.
>> 
> Do we allow this cross-access (hack)? It might cause some trouble from
> my perspective.

If the current implementation allows users to set up and take
advantage of, then it is not a hack.  It would cause breakage
if we make it an error.  Does such a symref successfully refer
to the referent right now?  I think it does.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/fsck-msgids.txt b/Documentation/fsck-msgids.txt
index 8827137ef0..03bcb77972 100644
--- a/Documentation/fsck-msgids.txt
+++ b/Documentation/fsck-msgids.txt
@@ -28,6 +28,12 @@ 
 `badRefName`::
 	(ERROR) A ref has an invalid format.
 
+`badReferentFiletype`::
+	(ERROR) The referent of a symref has a bad file type.
+
+`badReferentName`::
+	(ERROR) The referent name of a symref is invalid.
+
 `badTagName`::
 	(INFO) A tag has an invalid format.
 
@@ -49,6 +55,9 @@ 
 `emptyName`::
 	(WARN) A path contains an empty name.
 
+`escapeReferent`::
+	(ERROR) The referent of a symref is outside the "ref" directory.
+
 `extraHeaderEntry`::
 	(IGNORE) Extra headers found after `tagger`.
 
diff --git a/fsck.h b/fsck.h
index b85072df57..c90561c6db 100644
--- a/fsck.h
+++ b/fsck.h
@@ -34,11 +34,14 @@  enum fsck_msg_type {
 	FUNC(BAD_REF_CONTENT, ERROR) \
 	FUNC(BAD_REF_FILETYPE, ERROR) \
 	FUNC(BAD_REF_NAME, ERROR) \
+	FUNC(BAD_REFERENT_FILETYPE, ERROR) \
+	FUNC(BAD_REFERENT_NAME, ERROR) \
 	FUNC(BAD_TIMEZONE, ERROR) \
 	FUNC(BAD_TREE, ERROR) \
 	FUNC(BAD_TREE_SHA1, ERROR) \
 	FUNC(BAD_TYPE, ERROR) \
 	FUNC(DUPLICATE_ENTRIES, ERROR) \
+	FUNC(ESCAPE_REFERENT, ERROR) \
 	FUNC(MISSING_AUTHOR, ERROR) \
 	FUNC(MISSING_COMMITTER, ERROR) \
 	FUNC(MISSING_EMAIL, ERROR) \
diff --git a/refs/files-backend.c b/refs/files-backend.c
index df4ce270ae..0cb4a2da73 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -3434,11 +3434,80 @@  typedef int (*files_fsck_refs_fn)(struct ref_store *ref_store,
 				  const char *refs_check_dir,
 				  struct dir_iterator *iter);
 
+/*
+ * Check the symref "referent" and "referent_path". For textual symref,
+ * "referent" would be the content after "refs:".
+ */
+static int files_fsck_symref_target(struct fsck_options *o,
+				    struct fsck_ref_report *report,
+				    struct strbuf *referent,
+				    struct strbuf *referent_path)
+{
+	size_t len = referent->len - 1;
+	struct stat st;
+	int ret = 0;
+
+	if (!starts_with(referent->buf, "refs/")) {
+		ret = fsck_report_ref(o, report,
+				      FSCK_MSG_ESCAPE_REFERENT,
+				      "points to ref outside the refs directory");
+		goto out;
+	}
+
+	if (referent->buf[referent->len - 1] != '\n') {
+		ret = fsck_report_ref(o, report,
+				      FSCK_MSG_REF_MISSING_NEWLINE,
+				      "missing newline");
+		len++;
+	}
+
+	strbuf_rtrim(referent);
+	if (check_refname_format(referent->buf, 0)) {
+		ret = fsck_report_ref(o, report,
+				      FSCK_MSG_BAD_REFERENT_NAME,
+				      "points to refname with invalid format");
+		goto out;
+	}
+
+	if (len != referent->len) {
+		ret = fsck_report_ref(o, report,
+				      FSCK_MSG_TRAILING_REF_CONTENT,
+				      "trailing garbage in ref");
+	}
+
+	/*
+	 * Dangling symrefs are common and so we don't report them.
+	 */
+	if (lstat(referent_path->buf, &st)) {
+		if (errno != ENOENT) {
+			ret = error_errno(_("unable to stat '%s'"),
+					  referent_path->buf);
+		}
+		goto out;
+	}
+
+	/*
+	 * We cannot distinguish whether "refs/heads/a" is a directory or not by
+	 * using "check_refname_format(referent->buf, 0)". Instead, we need to
+	 * check the file type of the target.
+	 */
+	if (S_ISDIR(st.st_mode)) {
+		ret = fsck_report_ref(o, report,
+				      FSCK_MSG_BAD_REFERENT_FILETYPE,
+				      "points to the directory");
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
 static int files_fsck_refs_content(struct ref_store *ref_store,
 				   struct fsck_options *o,
 				   const char *refs_check_dir,
 				   struct dir_iterator *iter)
 {
+	struct strbuf referent_path = STRBUF_INIT;
 	struct strbuf ref_content = STRBUF_INIT;
 	struct strbuf referent = STRBUF_INIT;
 	struct strbuf refname = STRBUF_INIT;
@@ -3484,12 +3553,24 @@  static int files_fsck_refs_content(struct ref_store *ref_store,
 					      "trailing garbage in ref");
 			goto cleanup;
 		}
+	} else {
+		strbuf_addf(&referent_path, "%s/%s",
+			    ref_store->gitdir, referent.buf);
+		/*
+		 * the referent may contain the spaces and the newline, need to
+		 * trim for path.
+		 */
+		strbuf_rtrim(&referent_path);
+		ret = files_fsck_symref_target(o, &report,
+					       &referent,
+					       &referent_path);
 	}
 
 cleanup:
 	strbuf_release(&refname);
 	strbuf_release(&ref_content);
 	strbuf_release(&referent);
+	strbuf_release(&referent_path);
 	return ret;
 }
 
diff --git a/t/t0602-reffiles-fsck.sh b/t/t0602-reffiles-fsck.sh
index a06ad044f2..9580c340ab 100755
--- a/t/t0602-reffiles-fsck.sh
+++ b/t/t0602-reffiles-fsck.sh
@@ -209,4 +209,121 @@  test_expect_success 'regular ref content should be checked (aggregate)' '
 	test_cmp expect sorted_err
 '
 
+test_expect_success 'textual symref content should be checked (individual)' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	branch_dir_prefix=.git/refs/heads &&
+	tag_dir_prefix=.git/refs/tags &&
+	cd repo &&
+	test_commit default &&
+	mkdir -p "$branch_dir_prefix/a/b" &&
+
+	printf "ref: refs/heads/branch\n" >$branch_dir_prefix/branch-good &&
+	git refs verify 2>err &&
+	rm $branch_dir_prefix/branch-good &&
+	test_must_be_empty err &&
+
+	printf "ref: refs/heads/branch" >$branch_dir_prefix/branch-no-newline-1 &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/heads/branch-no-newline-1: refMissingNewline: missing newline
+	EOF
+	rm $branch_dir_prefix/branch-no-newline-1 &&
+	test_cmp expect err &&
+
+	printf "ref: refs/heads/branch     " >$branch_dir_prefix/a/b/branch-trailing-1 &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/heads/a/b/branch-trailing-1: refMissingNewline: missing newline
+	warning: refs/heads/a/b/branch-trailing-1: trailingRefContent: trailing garbage in ref
+	EOF
+	rm $branch_dir_prefix/a/b/branch-trailing-1 &&
+	test_cmp expect err &&
+
+	printf "ref: refs/heads/branch\n\n" >$branch_dir_prefix/a/b/branch-trailing-2 &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/heads/a/b/branch-trailing-2: trailingRefContent: trailing garbage in ref
+	EOF
+	rm $branch_dir_prefix/a/b/branch-trailing-2 &&
+	test_cmp expect err &&
+
+	printf "ref: refs/heads/branch \n" >$branch_dir_prefix/a/b/branch-trailing-3 &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/heads/a/b/branch-trailing-3: trailingRefContent: trailing garbage in ref
+	EOF
+	rm $branch_dir_prefix/a/b/branch-trailing-3 &&
+	test_cmp expect err &&
+
+	printf "ref: refs/heads/branch \n  " >$branch_dir_prefix/a/b/branch-complicated &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/heads/a/b/branch-complicated: refMissingNewline: missing newline
+	warning: refs/heads/a/b/branch-complicated: trailingRefContent: trailing garbage in ref
+	EOF
+	rm $branch_dir_prefix/a/b/branch-complicated &&
+	test_cmp expect err &&
+
+	printf "ref: refs/heads/.branch\n" >$branch_dir_prefix/branch-bad-1 &&
+	test_must_fail git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	error: refs/heads/branch-bad-1: badReferentName: points to refname with invalid format
+	EOF
+	rm $branch_dir_prefix/branch-bad-1 &&
+	test_cmp expect err &&
+
+	printf "ref: reflogs/heads/main\n" >$branch_dir_prefix/branch-bad-2 &&
+	test_must_fail git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	error: refs/heads/branch-bad-2: escapeReferent: points to ref outside the refs directory
+	EOF
+	rm $branch_dir_prefix/branch-bad-2 &&
+	test_cmp expect err &&
+
+	printf "ref: refs/heads/a\n" >$branch_dir_prefix/branch-bad-3 &&
+	test_must_fail git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	error: refs/heads/branch-bad-3: badReferentFiletype: points to the directory
+	EOF
+	rm $branch_dir_prefix/branch-bad-3 &&
+	test_cmp expect err
+'
+
+test_expect_success 'textual symref content should be checked (aggregate)' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	branch_dir_prefix=.git/refs/heads &&
+	tag_dir_prefix=.git/refs/tags &&
+	cd repo &&
+	test_commit default &&
+	mkdir -p "$branch_dir_prefix/a/b" &&
+
+	printf "ref: refs/heads/branch\n" >$branch_dir_prefix/branch-good &&
+	printf "ref: refs/heads/branch" >$branch_dir_prefix/branch-no-newline-1 &&
+	printf "ref: refs/heads/branch     " >$branch_dir_prefix/a/b/branch-trailing-1 &&
+	printf "ref: refs/heads/branch\n\n" >$branch_dir_prefix/a/b/branch-trailing-2 &&
+	printf "ref: refs/heads/branch \n" >$branch_dir_prefix/a/b/branch-trailing-3 &&
+	printf "ref: refs/heads/branch \n  " >$branch_dir_prefix/a/b/branch-complicated &&
+	printf "ref: refs/heads/.branch\n" >$branch_dir_prefix/branch-bad-1 &&
+	printf "ref: reflogs/heads/main\n" >$branch_dir_prefix/branch-bad-2 &&
+	printf "ref: refs/heads/a\n" >$branch_dir_prefix/branch-bad-3 &&
+
+	test_must_fail git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	error: refs/heads/branch-bad-1: badReferentName: points to refname with invalid format
+	error: refs/heads/branch-bad-2: escapeReferent: points to ref outside the refs directory
+	error: refs/heads/branch-bad-3: badReferentFiletype: points to the directory
+	warning: refs/heads/a/b/branch-complicated: refMissingNewline: missing newline
+	warning: refs/heads/a/b/branch-complicated: trailingRefContent: trailing garbage in ref
+	warning: refs/heads/a/b/branch-trailing-1: refMissingNewline: missing newline
+	warning: refs/heads/a/b/branch-trailing-1: trailingRefContent: trailing garbage in ref
+	warning: refs/heads/a/b/branch-trailing-2: trailingRefContent: trailing garbage in ref
+	warning: refs/heads/a/b/branch-trailing-3: trailingRefContent: trailing garbage in ref
+	warning: refs/heads/branch-no-newline-1: refMissingNewline: missing newline
+	EOF
+	sort err >sorted_err &&
+	test_cmp expect sorted_err
+'
+
 test_done