diff mbox series

[1/2] sequencer: truncate labels to accommodate loose refs

Message ID 4971e3c52504bf965aa754c9a5d31abddbcc1466.1691685300.git.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit 7481d2bfca7fd36f63fd437508be2bca338c9477
Headers show
Series sequencer: truncate lockfile and ref to NAME_MAX | expand

Commit Message

Mark Ruvald Pedersen Aug. 10, 2023, 4:34 p.m. UTC
From: Mark Ruvald Pedersen <mped@demant.com>

Some commits may have unusually long subject lines. When those subject
lines are used as labels in the `--rebase-merges` mode of `git rebase`,
they can cause errors when writing the corresponding loose refs because
most file systems have a maximal file name length of 255 (`NAME_MAX`).
The symptom looks like this:

	$ git rebase --continue
	error: cannot lock ref 'refs/rewritten/SANITIZED-SUBJECT': Unable to create '.git/refs/rewritten/SANITIZED-SUBJECT.lock': File name too long - where SANITIZED-SUBJECT is very long

Let's accommodate this situation by truncating the labels.

Care must be taken in case the subject line contains multi-byte
characters so as not to truncate in the middle of a character.

Signed-off-by: Mark Ruvald Pedersen <mped@demant.com>
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
---
 git-compat-util.h |  4 ++++
 sequencer.c       | 41 ++++++++++++++++++++++++++++++++++++-----
 2 files changed, 40 insertions(+), 5 deletions(-)

Comments

Junio C Hamano Aug. 10, 2023, 5:12 p.m. UTC | #1
"Mark Ruvald Pedersen via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> +/*
> + * To accommodate common filesystem limitations, where the loose refs' file
> + * names must not exceed `NAME_MAX`, the labels generated by `git rebase
> + * --rebase-merges` need to be truncated if the corresponding commit subjects
> + * are too long.
> + * Add some margin to stay clear from reaching `NAME_MAX`.
> + */
> +#define GIT_MAX_LABEL_LENGTH ((NAME_MAX) - (LOCK_SUFFIX_LEN) - 16)

OK.  Hopefully no systems defien NAME_MAX shorter than 20 bytes ;-).

We may suffix "-%d" to make it unique after this truncation, so
there definitely is a need for some slop, and 16-bytes should
sufficiently be long.

> @@ -5404,14 +5415,34 @@ static const char *label_oid(struct object_id *oid, const char *label,
>  		 *
>  		 * Note that we retain non-ASCII UTF-8 characters (identified
>  		 * via the most significant bit). They should be all acceptable
> -		 * in file names. We do not validate the UTF-8 here, that's not
> -		 * the job of this function.
> +		 * in file names.
> +		 *
> +		 * As we will use the labels as names of (loose) refs, it is
> +		 * vital that the name not be longer than the maximum component
> +		 * size of the file system (`NAME_MAX`). We are careful to
> +		 * truncate the label accordingly, allowing for the `.lock`
> +		 * suffix and for the label to be UTF-8 encoded (i.e. we avoid
> +		 * truncating in the middle of a character).
>  		 */
> -		for (; *label; label++)
> -			if ((*label & 0x80) || isalnum(*label))
> +		for (; *label && buf->len + 1 < max_len; label++)
> +			if (isalnum(*label) ||
> +			    (!label_is_utf8 && (*label & 0x80)))
>  				strbuf_addch(buf, *label);
> +			else if (*label & 0x80) {
> +				const char *p = label;
> +
> +				utf8_width(&p, NULL);
> +				if (p) {
> +					if (buf->len + (p - label) > max_len)
> +						break;
> +					strbuf_add(buf, label, p - label);
> +					label = p - 1;
> +				} else {
> +					label_is_utf8 = 0;
> +					strbuf_addch(buf, *label);
> +				}

Utf8_width() does let you advance one unicode character at a time as
its side effect, but it may be a bit overkill, as its primary
function is to compute the display width of that character.

We could take advantage of the fact that the first byte of a UTF-8
character has two high-bits set (i.e. 11xxxxxx) while the second and
subsequent bytes have only the top-bit set and the second highest
bit clear (i.e. 10xxxxxx) to simplify/optimize it.  If this were in
a performance sensitive codepath, that is.

I'll queue it as-is for now, as we are in "regression fix only"
phase of the cycle, and have enough time to polish it.

Thanks.

>  			/* avoid leading dash and double-dashes */
> -			else if (buf->len && buf->buf[buf->len - 1] != '-')
> +			} else if (buf->len && buf->buf[buf->len - 1] != '-')
>  				strbuf_addch(buf, '-');
>  		if (!buf->len) {
>  			strbuf_addstr(buf, "rev-");
Johannes Schindelin Aug. 16, 2023, 8:36 a.m. UTC | #2
Hi,

On Thu, 10 Aug 2023, Junio C Hamano wrote:

> "Mark Ruvald Pedersen via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> > +/*
> > + * To accommodate common filesystem limitations, where the loose refs' file
> > + * names must not exceed `NAME_MAX`, the labels generated by `git rebase
> > + * --rebase-merges` need to be truncated if the corresponding commit subjects
> > + * are too long.
> > + * Add some margin to stay clear from reaching `NAME_MAX`.
> > + */
> > +#define GIT_MAX_LABEL_LENGTH ((NAME_MAX) - (LOCK_SUFFIX_LEN) - 16)
>
> OK.  Hopefully no systems defien NAME_MAX shorter than 20 bytes ;-).

If there are, we already have problems with the following paths:

	#CHARS  git_path
	---------------------------------
	20      BISECT_ANCESTORS_OK
	20      BISECT_EXPECTED_REV
	20      BISECT_FIRST_PARENT
	22      fsmonitor--daemon.ipc
	23      drop_redundant_commits
	23      git-rebase-todo.backup
	23      keep_redundant_commits
	23      reschedule-failed-exec
	24      allow_rerere_autoupdate
	26      no-reschedule-failed-exec

So I think we're good ;-)

> We may suffix "-%d" to make it unique after this truncation, so
> there definitely is a need for some slop, and 16-bytes should
> sufficiently be long.
>
>
> > @@ -5404,14 +5415,34 @@ static const char *label_oid(struct object_id *oid, const char *label,
> >  		 *
> >  		 * Note that we retain non-ASCII UTF-8 characters (identified
> >  		 * via the most significant bit). They should be all acceptable
> > -		 * in file names. We do not validate the UTF-8 here, that's not
> > -		 * the job of this function.
> > +		 * in file names.
> > +		 *
> > +		 * As we will use the labels as names of (loose) refs, it is
> > +		 * vital that the name not be longer than the maximum component
> > +		 * size of the file system (`NAME_MAX`). We are careful to
> > +		 * truncate the label accordingly, allowing for the `.lock`
> > +		 * suffix and for the label to be UTF-8 encoded (i.e. we avoid
> > +		 * truncating in the middle of a character).
> >  		 */
> > -		for (; *label; label++)
> > -			if ((*label & 0x80) || isalnum(*label))
> > +		for (; *label && buf->len + 1 < max_len; label++)
> > +			if (isalnum(*label) ||
> > +			    (!label_is_utf8 && (*label & 0x80)))
> >  				strbuf_addch(buf, *label);
> > +			else if (*label & 0x80) {
> > +				const char *p = label;
> > +
> > +				utf8_width(&p, NULL);
> > +				if (p) {
> > +					if (buf->len + (p - label) > max_len)
> > +						break;
> > +					strbuf_add(buf, label, p - label);
> > +					label = p - 1;
> > +				} else {
> > +					label_is_utf8 = 0;
> > +					strbuf_addch(buf, *label);
> > +				}
>
> Utf8_width() does let you advance one unicode character at a time as
> its side effect, but it may be a bit overkill, as its primary
> function is to compute the display width of that character.
>
> We could take advantage of the fact that the first byte of a UTF-8
> character has two high-bits set (i.e. 11xxxxxx) while the second and
> subsequent bytes have only the top-bit set and the second highest
> bit clear (i.e. 10xxxxxx) to simplify/optimize it.  If this were in
> a performance sensitive codepath, that is.

It is not a performance-critical code path, so I erred on the side of
simplicity (although I have to admit that the post image of the diff is
not exactly for the faint of heart).

Could we maybe form the plan to keep in the back of our heads that we
already have a UTF-8-truncating functionality in sequencer, and in case
another user should turn up, implemnt that optimized function in
`utf8.[ch]`?

> I'll queue it as-is for now, as we are in "regression fix only"
> phase of the cycle, and have enough time to polish it.

Thanks,
Johannes
Junio C Hamano Aug. 16, 2023, 4:28 p.m. UTC | #3
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> It is not a performance-critical code path, so I erred on the side of
> simplicity (although I have to admit that the post image of the diff is
> not exactly for the faint of heart).
>
> Could we maybe form the plan to keep in the back of our heads that we
> already have a UTF-8-truncating functionality in sequencer, and in case
> another user should turn up, implemnt that optimized function in
> `utf8.[ch]`?

Yup, that is a good idea.  Even though this one only cares about the
bytecount, we'd eventually benefit from two variants, truncate by
bytecount and truncate by display width.  Both variants should
return an error when given a bytestring that does not make a valid
UTF-8 sequence, and leave it to the caller to truncate at byte
boundary as a fallback, which is trivial (the alternative would be
to do the truncation by the callee, but then caller cannot tell if
the returned result is a fallback result that the end user may need
to be warned about or a known-valid UTF-8 substring if we go that
route, so it would be suboptimal).

Thanks.
diff mbox series

Patch

diff --git a/git-compat-util.h b/git-compat-util.h
index d32aa754ae1..3e7a59b5ff1 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -422,6 +422,10 @@  char *gitdirname(char *);
 #define PATH_MAX 4096
 #endif
 
+#ifndef NAME_MAX
+#define NAME_MAX 255
+#endif
+
 typedef uintmax_t timestamp_t;
 #define PRItime PRIuMAX
 #define parse_timestamp strtoumax
diff --git a/sequencer.c b/sequencer.c
index adc9cfb4df3..be837bd2948 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -51,6 +51,15 @@ 
 
 #define GIT_REFLOG_ACTION "GIT_REFLOG_ACTION"
 
+/*
+ * To accommodate common filesystem limitations, where the loose refs' file
+ * names must not exceed `NAME_MAX`, the labels generated by `git rebase
+ * --rebase-merges` need to be truncated if the corresponding commit subjects
+ * are too long.
+ * Add some margin to stay clear from reaching `NAME_MAX`.
+ */
+#define GIT_MAX_LABEL_LENGTH ((NAME_MAX) - (LOCK_SUFFIX_LEN) - 16)
+
 static const char sign_off_header[] = "Signed-off-by: ";
 static const char cherry_picked_prefix[] = "(cherry picked from commit ";
 
@@ -5396,6 +5405,8 @@  static const char *label_oid(struct object_id *oid, const char *label,
 		}
 	} else {
 		struct strbuf *buf = &state->buf;
+		int label_is_utf8 = 1; /* start with this assumption */
+		size_t max_len = buf->len + GIT_MAX_LABEL_LENGTH;
 
 		/*
 		 * Sanitize labels by replacing non-alpha-numeric characters
@@ -5404,14 +5415,34 @@  static const char *label_oid(struct object_id *oid, const char *label,
 		 *
 		 * Note that we retain non-ASCII UTF-8 characters (identified
 		 * via the most significant bit). They should be all acceptable
-		 * in file names. We do not validate the UTF-8 here, that's not
-		 * the job of this function.
+		 * in file names.
+		 *
+		 * As we will use the labels as names of (loose) refs, it is
+		 * vital that the name not be longer than the maximum component
+		 * size of the file system (`NAME_MAX`). We are careful to
+		 * truncate the label accordingly, allowing for the `.lock`
+		 * suffix and for the label to be UTF-8 encoded (i.e. we avoid
+		 * truncating in the middle of a character).
 		 */
-		for (; *label; label++)
-			if ((*label & 0x80) || isalnum(*label))
+		for (; *label && buf->len + 1 < max_len; label++)
+			if (isalnum(*label) ||
+			    (!label_is_utf8 && (*label & 0x80)))
 				strbuf_addch(buf, *label);
+			else if (*label & 0x80) {
+				const char *p = label;
+
+				utf8_width(&p, NULL);
+				if (p) {
+					if (buf->len + (p - label) > max_len)
+						break;
+					strbuf_add(buf, label, p - label);
+					label = p - 1;
+				} else {
+					label_is_utf8 = 0;
+					strbuf_addch(buf, *label);
+				}
 			/* avoid leading dash and double-dashes */
-			else if (buf->len && buf->buf[buf->len - 1] != '-')
+			} else if (buf->len && buf->buf[buf->len - 1] != '-')
 				strbuf_addch(buf, '-');
 		if (!buf->len) {
 			strbuf_addstr(buf, "rev-");