Message ID | 20230901142624.12063-1-five231003@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | ref-filter: sort numerically when ":size" is used | expand |
Kousik Sanagavarapu <five231003@gmail.com> writes: > Atoms like "raw" and "contents" have a ":size" option which can be used > to know the size of the data. Since these atoms have the cmp_type > FIELD_STR, they are sorted alphabetically from 'a' to 'z' and '0' to > '9'. Meaning, even when the ":size" option is used and what we > ultimatlely have is numbers, we still sort alphabetically. There are other cmp_types already defined, like ULONG and TIME. How are they used and affect the comparison? Naively, :size sounds like a good candidate to compare as ULONG, as it cannot be negative even though 0 is a valid size. I understand and agree with the motivation, but the implementation looks puzzling. > diff --git a/ref-filter.c b/ref-filter.c > index 1bfaf20fbf..5d7bea5f23 100644 > --- a/ref-filter.c > +++ b/ref-filter.c > @@ -932,7 +932,13 @@ struct atom_value { > ssize_t s_size; > int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state, > struct strbuf *err); > - uintmax_t value; /* used for sorting when not FIELD_STR */ > + > + /* > + * Used for sorting when not FIELD_STR or when FIELD_STR but the > + * sort should be numeric and not alphabetic. > + */ > + uintmax_t value; This does not explain why we cannot make <anything>:size FIELD_ULONG for <anything> that is of FIELD_STR type, though. IOW, why such a strange "when FIELD_STR but the sort should be numeric" is needed? If you have a <size>, shouldn't it always be numeric? IOW, when you notice that you need to set, say, u.contents.option of an atom to C_LENGTH, shouldn't you set cmp_type of the atom to FIELD_ULONG, somewhere in contents_atom_parser() and friends, and everything should naturally follow, no? It seems that support for other cmp_types are incomplete in the current code. There are FIELD_ULONG and FIELD_TIME defined, but they do not appear to be used in any way, so the cmp_ref_sorting() would need to be updated to make it actually pay attention to the cmp_type and perform numeric comparison. > @@ -1883,8 +1890,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp > v->s = strbuf_detach(&sb, NULL); > } else if (atom->u.contents.option == C_BODY_DEP) > v->s = xmemdupz(bodypos, bodylen); > - else if (atom->u.contents.option == C_LENGTH) > - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos)); > + else if (atom->u.contents.option == C_LENGTH) { > + v->value = (uintmax_t)strlen(subpos); > + v->s = xstrfmt("%"PRIuMAX, v->value); > + } We should take a note that all of these v->value are *per* *item* that will be sorted. > @@ -2986,7 +2996,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru > cmp_detached_head = 1; > } else if (s->sort_flags & REF_SORTING_VERSION) { > cmp = versioncmp(va->s, vb->s); > - } else if (cmp_type == FIELD_STR) { > + } else if (cmp_type == FIELD_STR && !va->value && !vb->value) { Two refs may point at an empty object with zero length, i.e. for them, !v->value is true, and another ref may point at a non-empty object. The two empty refs are compared with an algorithm different from the algorithm used to compare the empty ref and the non-empty ref. Isn't this broken as a comparison function to be given to QSORT(), which must be transitive (e.g. if A < B and B < C, then it should be guaranteed that A < C and you do not have to compare A and C)? IOW, the choice of the comparison algorithm should not depend on an attribute (like value or s) that is specific to the item being compared. Things like cmp_type that is defined at the used_atom lefvel to make the sorting function stable, I would think.
On Fri, Sep 01, 2023 at 09:43:07AM -0700, Junio C Hamano wrote: > IOW, when you notice that you need to set, say, u.contents.option of > an atom to C_LENGTH, shouldn't you set cmp_type of the atom to > FIELD_ULONG, somewhere in contents_atom_parser() and friends, and > everything should naturally follow, no? Yeah, I had the same thought after reading the patch. Unfortunately the "type" is used only for comparison, not formatting. So you are still stuck setting both v->value and v->s in grab_sub_body_contents(). It feels like we could hoist that xstrfmt("%"PRIuMAX) to a higher level as a preparatory refactoring. But it's not that big a deal to work around it if that turns out to be hard. > It seems that support for other cmp_types are incomplete in the > current code. There are FIELD_ULONG and FIELD_TIME defined, but > they do not appear to be used in any way, so the cmp_ref_sorting() > would need to be updated to make it actually pay attention to the > cmp_type and perform numeric comparison. I think they are covered implicitly by the "else" block of the conditional that checks for FIELD_STR. So just this is sufficient to make contents:size work: diff --git a/ref-filter.c b/ref-filter.c index 88b021dd1d..02e3b6ba82 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -583,9 +583,10 @@ static int contents_atom_parser(struct ref_format *format, struct used_atom *ato atom->u.contents.option = C_BARE; else if (!strcmp(arg, "body")) atom->u.contents.option = C_BODY; - else if (!strcmp(arg, "size")) + else if (!strcmp(arg, "size")) { + atom->type = FIELD_ULONG; atom->u.contents.option = C_LENGTH; - else if (!strcmp(arg, "signature")) + } else if (!strcmp(arg, "signature")) atom->u.contents.option = C_SIG; else if (!strcmp(arg, "subject")) atom->u.contents.option = C_SUB; @@ -1885,9 +1886,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp v->s = strbuf_detach(&sb, NULL); } else if (atom->u.contents.option == C_BODY_DEP) v->s = xmemdupz(bodypos, bodylen); - else if (atom->u.contents.option == C_LENGTH) - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos)); - else if (atom->u.contents.option == C_BODY) + else if (atom->u.contents.option == C_LENGTH) { + v->value = strlen(subpos); + v->s = xstrfmt("%"PRIuMAX, v->value); + } else if (atom->u.contents.option == C_BODY) v->s = xmemdupz(bodypos, nonsiglen); else if (atom->u.contents.option == C_SIG) v->s = xmemdupz(sigpos, siglen); -Peff
Jeff King <peff@peff.net> writes: > On Fri, Sep 01, 2023 at 09:43:07AM -0700, Junio C Hamano wrote: > >> IOW, when you notice that you need to set, say, u.contents.option of >> an atom to C_LENGTH, shouldn't you set cmp_type of the atom to >> FIELD_ULONG, somewhere in contents_atom_parser() and friends, and >> everything should naturally follow, no? > > Yeah, I had the same thought after reading the patch. Unfortunately the > "type" is used only for comparison, not formatting. So you are still > stuck setting both v->value and v->s in grab_sub_body_contents(). It > feels like we could hoist that xstrfmt("%"PRIuMAX) to a higher level as > a preparatory refactoring. But it's not that big a deal to work around > it if that turns out to be hard. Setting of the .value member happens O(N) times for the number of refs involved, which does not bother me. Do you mean "when we know we are not sorting with size we should omit parsing the string into the .value member"? If so, I think that would be nice to have. >> It seems that support for other cmp_types are incomplete in the >> current code. There are FIELD_ULONG and FIELD_TIME defined, but >> they do not appear to be used in any way, so the cmp_ref_sorting() >> would need to be updated to make it actually pay attention to the >> cmp_type and perform numeric comparison. > > I think they are covered implicitly by the "else" block of the > conditional that checks for FIELD_STR. Ah, OK. That needs to be future-proofed to force future developers who want to add different FIELD_FOO type to look at the comparison logic. If we want to do so, it should be done as a separate topic for cleaning-up the mess, not as part of this effort. > So just this is sufficient to make contents:size work: > > diff --git a/ref-filter.c b/ref-filter.c > index 88b021dd1d..02e3b6ba82 100644 > --- a/ref-filter.c > +++ b/ref-filter.c > @@ -583,9 +583,10 @@ static int contents_atom_parser(struct ref_format *format, struct used_atom *ato > atom->u.contents.option = C_BARE; > else if (!strcmp(arg, "body")) > atom->u.contents.option = C_BODY; > - else if (!strcmp(arg, "size")) > + else if (!strcmp(arg, "size")) { > + atom->type = FIELD_ULONG; > atom->u.contents.option = C_LENGTH; > - else if (!strcmp(arg, "signature")) > + } else if (!strcmp(arg, "signature")) > atom->u.contents.option = C_SIG; > else if (!strcmp(arg, "subject")) > atom->u.contents.option = C_SUB; > @@ -1885,9 +1886,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp > v->s = strbuf_detach(&sb, NULL); > } else if (atom->u.contents.option == C_BODY_DEP) > v->s = xmemdupz(bodypos, bodylen); > - else if (atom->u.contents.option == C_LENGTH) > - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos)); > - else if (atom->u.contents.option == C_BODY) > + else if (atom->u.contents.option == C_LENGTH) { > + v->value = strlen(subpos); > + v->s = xstrfmt("%"PRIuMAX, v->value); > + } else if (atom->u.contents.option == C_BODY) > v->s = xmemdupz(bodypos, nonsiglen); > else if (atom->u.contents.option == C_SIG) > v->s = xmemdupz(sigpos, siglen); Yup, exactly. Thanks.
On Fri, Sep 01, 2023 at 10:59:28AM -0700, Junio C Hamano wrote: > > Yeah, I had the same thought after reading the patch. Unfortunately the > > "type" is used only for comparison, not formatting. So you are still > > stuck setting both v->value and v->s in grab_sub_body_contents(). It > > feels like we could hoist that xstrfmt("%"PRIuMAX) to a higher level as > > a preparatory refactoring. But it's not that big a deal to work around > > it if that turns out to be hard. > > Setting of the .value member happens O(N) times for the number of > refs involved, which does not bother me. Do you mean "when we know > we are not sorting with size we should omit parsing the string into > the .value member"? If so, I think that would be nice to have. No, I wasn't worried about code efficiency, but rather programmer effort. IOW, I expected that the second hunk that I showed could look like this: diff --git a/ref-filter.c b/ref-filter.c index 88b021dd1d..02b02d6813 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1886,7 +1886,7 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp } else if (atom->u.contents.option == C_BODY_DEP) v->s = xmemdupz(bodypos, bodylen); else if (atom->u.contents.option == C_LENGTH) - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos)); + v->value = strlen(subpos); else if (atom->u.contents.option == C_BODY) v->s = xmemdupz(bodypos, nonsiglen); else if (atom->u.contents.option == C_SIG) rather than setting both "value" and "s", and that some higher level code would recognize "oh, this is FIELD_ULONG, so I'll format it rather than looking at v->s". But it seems that such code does not exist. :) All of the other spots that set v->value (e.g., objectsize), just set both. The ref-filter code is a weird mix of almost-object-oriented bits, giant switch statements, and assumptions about which fields are set when. > > I think they are covered implicitly by the "else" block of the > > conditional that checks for FIELD_STR. > > Ah, OK. That needs to be future-proofed to force future developers > who want to add different FIELD_FOO type to look at the comparison > logic. If we want to do so, it should be done as a separate topic > for cleaning-up the mess, not as part of this effort. Yes, agreed. -Peff
On Fri, Sep 01, 2023 at 02:32:06PM -0400, Jeff King wrote: > On Fri, Sep 01, 2023 at 10:59:28AM -0700, Junio C Hamano wrote: > > > > Yeah, I had the same thought after reading the patch. Unfortunately the > > > "type" is used only for comparison, not formatting. So you are still > > > stuck setting both v->value and v->s in grab_sub_body_contents(). It > > > feels like we could hoist that xstrfmt("%"PRIuMAX) to a higher level as > > > a preparatory refactoring. But it's not that big a deal to work around > > > it if that turns out to be hard. > > > > Setting of the .value member happens O(N) times for the number of > > refs involved, which does not bother me. Do you mean "when we know > > we are not sorting with size we should omit parsing the string into > > the .value member"? If so, I think that would be nice to have. > > No, I wasn't worried about code efficiency, but rather programmer > effort. IOW, I expected that the second hunk that I showed could look > like this: > > diff --git a/ref-filter.c b/ref-filter.c > index 88b021dd1d..02b02d6813 100644 > --- a/ref-filter.c > +++ b/ref-filter.c > @@ -1886,7 +1886,7 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp > } else if (atom->u.contents.option == C_BODY_DEP) > v->s = xmemdupz(bodypos, bodylen); > else if (atom->u.contents.option == C_LENGTH) > - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos)); > + v->value = strlen(subpos); > else if (atom->u.contents.option == C_BODY) > v->s = xmemdupz(bodypos, nonsiglen); > else if (atom->u.contents.option == C_SIG) This looks very tempting, although too good to be true with the current ref-filter I guess, as you explain below. > rather than setting both "value" and "s", and that some higher level > code would recognize "oh, this is FIELD_ULONG, so I'll format it rather > than looking at v->s". But it seems that such code does not exist. :) > All of the other spots that set v->value (e.g., objectsize), just set > both. This was also one of the reasons why I decided to set both v->value and v->s, that is because "objectsize" was implemented in a similar fashion. Although I left "cmp_type" field untouched for the reasons below. > > > I think they are covered implicitly by the "else" block of the > > > conditional that checks for FIELD_STR. > > > > Ah, OK. That needs to be future-proofed to force future developers > > who want to add different FIELD_FOO type to look at the comparison > > logic. If we want to do so, it should be done as a separate topic > > for cleaning-up the mess, not as part of this effort. What I also find weird is the fact that we assign a "cmp_type" to the whole atom. Like "contents" is FIELD_STR and "objectsize" is "FIELD_ULONG" in "valid_atom". This seems wrong because the options of the atoms should be the ones deciding the "cmp_type", no? I wanted to leave the "cmp_type" field of the atom untouched because that would mess up this "global" setting of "contents" to be a "FIELD_STR" (or even "raw" for that matter). Although that seems like a bad idea, after I've read Junio's and your comments. Thanks > > Yes, agreed. > > -Peff
On Sat, Sep 02, 2023 at 12:29:07AM +0530, Kousik Sanagavarapu wrote: > > > > I think they are covered implicitly by the "else" block of the > > > > conditional that checks for FIELD_STR. > > > > > > Ah, OK. That needs to be future-proofed to force future developers > > > who want to add different FIELD_FOO type to look at the comparison > > > logic. If we want to do so, it should be done as a separate topic > > > for cleaning-up the mess, not as part of this effort. > > What I also find weird is the fact that we assign a "cmp_type" to the > whole atom. Like "contents" is FIELD_STR and "objectsize" is "FIELD_ULONG" > in "valid_atom". This seems wrong because the options of the atoms should be > the ones deciding the "cmp_type", no? I think the data structure is a little confusing if you haven't worked with it before. But basically each "atom" corresponds to a single "%()" block in the format. So if you ran: git for-each-ref --format="%(contents:size) %(contents:body)" you'd have two atoms in the used_atom struct: one for the size and one for the body. IMHO the code would be a lot easier to work with if the atoms were structured as a parse tree with child pointers (especially when you get into things like "if" that have sub-expressions). I think one of the reasons that used_atom is an array is to de-duplicate repeated mentions (so if you formatted "%(foo) %(foo)" it would only have to store the computed value once). But I think that is the wrong way to optimize it. We shouldn't be storing any strings per-atom, but rather walking the parse tree to produce a single output buffer. And the values should be cheap to fill in, because we should parse the object as necessary up front. This is more or less the way the pretty.c parser does it. But that is all quite a large tangent from what you're working on, and would probably be a ground-up rewrite of the formatting code. You can safely ignore my rant for the purposes of your patch. ;) > I wanted to leave the "cmp_type" field of the atom untouched because that > would mess up this "global" setting of "contents" to be a "FIELD_STR" (or > even "raw" for that matter). Although that seems like a bad idea, after > I've read Junio's and your comments. Yeah, I agree that would be a problem if there were one global "contents". But we are allocating a new atom struct on the fly for the contents:size directive that we parse, and so on. -Peff
Kousik Sanagavarapu <five231003@gmail.com> writes: > What I also find weird is the fact that we assign a "cmp_type" to the > whole atom. Like "contents" is FIELD_STR and "objectsize" is "FIELD_ULONG" > in "valid_atom". This seems wrong because the options of the atoms should be > the ones deciding the "cmp_type", no? I do not quite get where your confusion comes from. The use of valid_atom[] purely for catalogging things like "contents", "refname", etc., before specialization, as opposed to used_atom[] that lists the actual specialized form of the atoms used in the format string. If you refer to "contents:body" and "contents:size" in your format string, they become two entries in used_atom[], both of which refer to the same atom_type obtained by consulting the same entry in the valid_atom[] array. The specialization between "contents:body" and "contents:size" must be captured somewhere, and that happens by using two used_atom[] entries. There will be one "struct atom_value" for each of these placeholders, each of which refers to its own used_atom that knows for which variant of "contents" it was created. Of course, these two "struct atom_value" instances will have different content string for the same ref (one stores the body part of the string, the other stores the size of the contents). > I wanted to leave the "cmp_type" field of the atom untouched because that > would mess up this "global" setting of "contents" to be a "FIELD_STR" (or > even "raw" for that matter). We are not talking about futzing with valid_atom[] array. Because the used_atom[] array is designed to be used to capture the differences among "contents" vs "contents:body" vs "contents:size", what types of entities the values that uses an entry in used_atom[] array (i.e. an instance of "struct atom_value") should be decided using the information stored there. I agree that Peff's "the value for 'contents:size' we know is numeric, so only store the numeric value in atom_value and let the output logic handle that using cmp_type information" sound very tempting. If we were to tackle it, however, I think it should be a separate topic. In any case, it was very good that you noticed we do not sort numerically when sorting by size (I guess our sort by timestamp weren't affected only because we have been lucky?). Thanks for starting this topic.
Jeff King <peff@peff.net> writes: > But I think that is the wrong way to optimize it. We shouldn't be > storing any strings per-atom, but rather walking the parse tree to > produce a single output buffer. And the values should be cheap to fill > in, because we should parse the object as necessary up front. This is > more or less the way the pretty.c parser does it. I thought "as necessary" may be a bit tricky as populate_value() were taught to omit doing the whole get_object() thing when the values for used_atom[] are all computable without parsing the object at all, but it seems that over time the populate_value() callchain has degraded sufficiently to unconditionally call get_object() these days, so I agree that the arrangement does not have much optimization value, at least in the current code.
On Fri, Sep 01, 2023 at 01:04:08PM -0700, Junio C Hamano wrote: > In any case, it was very good that you noticed we do not sort > numerically when sorting by size (I guess our sort by timestamp > weren't affected only because we have been lucky?). Thanks for > starting this topic. I think the date code works as expected, and by design. It's just that the code is a little confusing. :) Something like %(committerdate) gets a used_atom with FIELD_TIME, and the "grab" function sets v->value to the unix-epoch timestamp. And then the comparison function uses v->value, since it's not FIELD_STR. It's a little hard to follow the code that fills in v->value, but the call stack is roughly grab_values() -> grab_person() -> grab_date(). -Peff
On Fri, Sep 01, 2023 at 01:21:10PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > But I think that is the wrong way to optimize it. We shouldn't be > > storing any strings per-atom, but rather walking the parse tree to > > produce a single output buffer. And the values should be cheap to fill > > in, because we should parse the object as necessary up front. This is > > more or less the way the pretty.c parser does it. > > I thought "as necessary" may be a bit tricky as populate_value() > were taught to omit doing the whole get_object() thing when the > values for used_atom[] are all computable without parsing the object > at all, but it seems that over time the populate_value() callchain > has degraded sufficiently to unconditionally call get_object() these > days, so I agree that the arrangement does not have much optimization > value, at least in the current code. No, I think we still do that optimization. When parsing the format string, the parser function for each atom sets fields in an object_info struct to indicate what we're interested in. Then for each ref, we call populate_value(). If that object_info doesn't need anything (we byte-wise compare it to an empty dummy struct), then we return early, before calling get_object(). And that optimization is very important to retain; it makes a format like %(refname) an order of magnitude faster. The optimization I was referring to is that if you have a format like: %(contents:body) %(contents:body) then we'll de-duplicate that to a single used_atom struct, and they'll share the same v->s result string. That's much harder to do if you parse into an abstract syntax tree, since the two occupy different parts of the tree. But my contention is that it does not matter if you stop allocating v->s in the first place, and just walk the tree to directly output the result (either to a strbuf or directly to stdout). -Peff
diff --git a/ref-filter.c b/ref-filter.c index 1bfaf20fbf..5d7bea5f23 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -932,7 +932,13 @@ struct atom_value { ssize_t s_size; int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state, struct strbuf *err); - uintmax_t value; /* used for sorting when not FIELD_STR */ + + /* + * Used for sorting when not FIELD_STR or when FIELD_STR but the + * sort should be numeric and not alphabetic. + */ + uintmax_t value; + struct used_atom *atom; }; @@ -1857,7 +1863,8 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp v->s = xmemdupz(buf, buf_size); v->s_size = buf_size; } else if (atom->u.raw_data.option == RAW_LENGTH) { - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size); + v->value = (uintmax_t)buf_size; + v->s = xstrfmt("%"PRIuMAX, v->value); } continue; } @@ -1883,8 +1890,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp v->s = strbuf_detach(&sb, NULL); } else if (atom->u.contents.option == C_BODY_DEP) v->s = xmemdupz(bodypos, bodylen); - else if (atom->u.contents.option == C_LENGTH) - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos)); + else if (atom->u.contents.option == C_LENGTH) { + v->value = (uintmax_t)strlen(subpos); + v->s = xstrfmt("%"PRIuMAX, v->value); + } else if (atom->u.contents.option == C_BODY) v->s = xmemdupz(bodypos, nonsiglen); else if (atom->u.contents.option == C_SIG) @@ -2265,6 +2274,7 @@ static int populate_value(struct ref_array_item *ref, struct strbuf *err) v->s_size = ATOM_SIZE_UNSPECIFIED; v->handler = append_atom; + v->value = 0; v->atom = atom; if (*name == '*') { @@ -2986,7 +2996,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru cmp_detached_head = 1; } else if (s->sort_flags & REF_SORTING_VERSION) { cmp = versioncmp(va->s, vb->s); - } else if (cmp_type == FIELD_STR) { + } else if (cmp_type == FIELD_STR && !va->value && !vb->value) { if (va->s_size < 0 && vb->s_size < 0) { int (*cmp_fn)(const char *, const char *); cmp_fn = s->sort_flags & REF_SORTING_ICASE diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh index aa3c7c03c4..7b943fd34c 100755 --- a/t/t6300-for-each-ref.sh +++ b/t/t6300-for-each-ref.sh @@ -1017,16 +1017,16 @@ test_expect_success 'Verify sorts with raw' ' test_expect_success 'Verify sorts with raw:size' ' cat >expected <<-EOF && refs/myblobs/blob8 - refs/myblobs/first refs/myblobs/blob7 - refs/heads/main refs/myblobs/blob4 refs/myblobs/blob1 refs/myblobs/blob2 refs/myblobs/blob3 refs/myblobs/blob5 refs/myblobs/blob6 + refs/myblobs/first refs/mytrees/first + refs/heads/main EOF git for-each-ref --format="%(refname)" --sort=raw:size \ refs/heads/main refs/myblobs/ refs/mytrees/first >actual && @@ -1138,6 +1138,17 @@ test_expect_success 'for-each-ref --format compare with cat-file --batch' ' test_cmp expected actual ' +test_expect_success 'verify sorts with contents:size' ' + cat >expect <<-\EOF && + refs/heads/main + refs/heads/newtag + refs/heads/ambiguous + EOF + git for-each-ref --format="%(refname)" \ + --sort=contents:size refs/heads/ >actual && + test_cmp expect actual +' + test_expect_success 'set up multiple-sort tags' ' for when in 100000 200000 do
Atoms like "raw" and "contents" have a ":size" option which can be used to know the size of the data. Since these atoms have the cmp_type FIELD_STR, they are sorted alphabetically from 'a' to 'z' and '0' to '9'. Meaning, even when the ":size" option is used and what we ultimatlely have is numbers, we still sort alphabetically. For example, consider the the following case in a repo refname contents:size raw:size ======= ============= ======== refs/heads/branch1 1130 1210 refs/heads/master 300 410 refs/tags/v1.0 140 260 Sorting with "--format="%(refname) %(contents:size) --sort=contents:size" would give refs/heads/branch1 1130 refs/tags/v1.0.0 140 refs/heads/master 300 which is an alphabetic sort, while what one might really expect is refs/tags/v1.0.0 140 refs/heads/master 300 refs/heads/branch1 1130 which is a numeric sort (that is, a "$ sort -n file" as opposed to a "$ sort file", where "file" contains only the "contents:size" or "raw:size" info, each of which is on a newline). Same is the case with "--sort=raw:size". So, sort numerically whenever the sort is done with "contents:size" or "raw:size" and do it the normal alphabetic way when "contents" or "raw" are used with some other option (they are FIELD_STR anyways). Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com> --- ref-filter.c | 20 +++++++++++++++----- t/t6300-for-each-ref.sh | 15 +++++++++++++-- 2 files changed, 28 insertions(+), 7 deletions(-)