[12/14] kconfig: sort found symbols by relevance
diff mbox

Message ID 3e2346029e33ef9405b87d72a32d96d0aafb524a.1371595499.git.yann.morin.1998@free.fr
State New, archived
Headers show

Commit Message

Yann E. MORIN June 18, 2013, 10:45 p.m. UTC
From: "Yann E. MORIN" <yann.morin.1998@free.fr>

When searching for symbols, return the symbols sorted by relevance.

Sorting is done as thus:
  - first, symbols with a prompt,   [1]
  - then, smallest offset,          [2]
  - then, shortest match,           [3]
  - then, highest relative match,   [4]
  - finally, alphabetical sort      [5]

So, searching (eg.) for 'P.*CI' :

[1] Symbols of interest are probably those with a prompt, as they can be
    changed, while symbols with no prompt are only for info. Thus:
        PCIEASPM comes before PCI_ATS

[2] Symbols that match earlier in the name are to be preferred over
    symbols which match later. Thus:
        PCI_MSI comes before WDTPCI

[3] The shortest match is (IMHO) more interesting than a longer one.
    Thus:
        PCI comes before PCMCIA

[4] The relative match is the ratio of the length of the match against
    the length of the symbol. The more of a symbol name we match, the
    more instersting that symbol is. Thus:
        PCIEAER comes before PCIEASPM

[5] As fallback, sort symbols alphabetically

This heuristic tries hard to get interesting symbols first in the list.

In any case, exact match can (as previously) be requested by using
start-of-line and end-of-line in the search regexp: ^PCI$

Reported-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: "Yann E. MORIN" <yann.morin.1998@free.fr>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Roland Eggner <edvx1@systemanalysen.net>
Cc: Wang YanQing <udknight@gmail.com>
---
 scripts/kconfig/symbol.c | 96 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 9 deletions(-)

Comments

Jean Delvare June 24, 2013, 7:57 a.m. UTC | #1
Hi Yann,

Sorry for the late reply...

Le Wednesday 19 June 2013 à 00:45 +0200, Yann E. MORIN a écrit :
> From: "Yann E. MORIN" <yann.morin.1998@free.fr>
> 
> When searching for symbols, return the symbols sorted by relevance.
> 
> Sorting is done as thus:
>   - first, symbols with a prompt,   [1]
>   - then, smallest offset,          [2]
>   - then, shortest match,           [3]
>   - then, highest relative match,   [4]
>   - finally, alphabetical sort      [5]
> 
> So, searching (eg.) for 'P.*CI' :

Nobody would actually search for that, so that's not a particularly good
example to determine whether your sort order is sane or not.

> 
> [1] Symbols of interest are probably those with a prompt, as they can be
>     changed, while symbols with no prompt are only for info. Thus:
>         PCIEASPM comes before PCI_ATS

This is not necessarily true. I often look for symbols which have no
prompt, and the information I am looking for is exactly "does this
symbol have a prompt or is its value determined automatically"?

> [2] Symbols that match earlier in the name are to be preferred over
>     symbols which match later. Thus:
>         PCI_MSI comes before WDTPCI

This makes some sense, although it could have some unexpected side
effects (e.g. FOO_BAR_PCI would be listed before SOMETHING_PCI_BAZBAZ,
right?)

> [3] The shortest match is (IMHO) more interesting than a longer one.
>     Thus:
>         PCI comes before PCMCIA

This makes sense too, but I'm sure there are cases where it will be
confusing too, and alphabetical order would do it in part too.

> [4] The relative match is the ratio of the length of the match against
>     the length of the symbol. The more of a symbol name we match, the
>     more instersting that symbol is. Thus:
>         PCIEAER comes before PCIEASPM

This is an obscure sort rule and I'm sure it will add more confusion
than it will help. Alphabetical order should really be good enough at
this point.

> 
> [5] As fallback, sort symbols alphabetically
> 
> This heuristic tries hard to get interesting symbols first in the list.

I know I am the one whose question triggered this work from you, but in
the end the "response" seems disproportionate. Having 5 different
ordering rules is a lot, and that's quite a bit of code, which while not
the most complex in the world, is still far from trivial and may require
maintenance work in the future.

The result of the search will be read by a human. As such, the order of
the results should be immediately obvious. Returning what the user is
most probably looking for first in the list is not as important as
making it possible for the user to search the results for what he/she is
looking for, and this can only be the case if the user understands the
sort order instinctively. The 5 rules you used aren't exactly that IMHO.
This situation is completely different from a web search for example,
where the search engine has to select the most relevant pages because
listing them all is out of the question.

The original order was random and that's the worse possible order, so I
am all for improving it. But I think the sort order should be driven by
simple and instinctive rules. This means 1 or 2 ordering rules, not 5.
As I recall, my original proposal was "exact match first, then
alphabetical order." 2 rules, instinctive and easy to document.

> In any case, exact match can (as previously) be requested by using
> start-of-line and end-of-line in the search regexp: ^PCI$

Actually this solved my original problem just fine. As a matter of fact,
I stopped replying to the original thread shortly after learning this,
because I kind of lost interest in the following discussion.

So I think it is more important to make it clearer that regular
expressions are allowed, than to come up with a brilliant sort order. I
know the help page says it, but the prompt itself only asks for a
"(sub)string" and it is not immediately obvious (to me at least) that
regular expressions are considered substrings.

BTW, if you really decide to commit your proposal despite my concerns,
you will have to document the sorting rules in the help page.

Thanks,
Michal Marek June 24, 2013, 8:42 a.m. UTC | #2
On 24.6.2013 09:57, Jean Delvare wrote:
> Hi Yann,
> 
> Sorry for the late reply...
> 
> Le Wednesday 19 June 2013 à 00:45 +0200, Yann E. MORIN a écrit :
>> From: "Yann E. MORIN" <yann.morin.1998@free.fr>
>>
>> When searching for symbols, return the symbols sorted by relevance.
>>
>> Sorting is done as thus:
>>   - first, symbols with a prompt,   [1]
>>   - then, smallest offset,          [2]
>>   - then, shortest match,           [3]
>>   - then, highest relative match,   [4]
>>   - finally, alphabetical sort      [5]
>>
>> So, searching (eg.) for 'P.*CI' :
> 
> Nobody would actually search for that, so that's not a particularly good
> example to determine whether your sort order is sane or not.
> 
>>
>> [1] Symbols of interest are probably those with a prompt, as they can be
>>     changed, while symbols with no prompt are only for info. Thus:
>>         PCIEASPM comes before PCI_ATS
> 
> This is not necessarily true. I often look for symbols which have no
> prompt, and the information I am looking for is exactly "does this
> symbol have a prompt or is its value determined automatically"?
> 
>> [2] Symbols that match earlier in the name are to be preferred over
>>     symbols which match later. Thus:
>>         PCI_MSI comes before WDTPCI
> 
> This makes some sense, although it could have some unexpected side
> effects (e.g. FOO_BAR_PCI would be listed before SOMETHING_PCI_BAZBAZ,
> right?)
> 
>> [3] The shortest match is (IMHO) more interesting than a longer one.
>>     Thus:
>>         PCI comes before PCMCIA
> 
> This makes sense too, but I'm sure there are cases where it will be
> confusing too, and alphabetical order would do it in part too.

PCI does sort before PCMCIA alphabetically ;). Also, the match lenght
will only be different when using regular expressions, so this rule will
not matter in the usual case.


>> [4] The relative match is the ratio of the length of the match against
>>     the length of the symbol. The more of a symbol name we match, the
>>     more instersting that symbol is. Thus:
>>         PCIEAER comes before PCIEASPM
> 
> This is an obscure sort rule and I'm sure it will add more confusion
> than it will help. Alphabetical order should really be good enough at
> this point.

When the prefix matches, then I agree that most people will expect the
matches be alphabetically sorted. Maybe only do this comparison only if
s1->so != 0, otherwise fallback to the strcmp() directly?

Thanks,
Michal
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann E. MORIN June 24, 2013, 5:13 p.m. UTC | #3
Jean, Michal, All,

On 2013-06-24 09:57 +0200, Jean Delvare spake thusly:
> Le Wednesday 19 June 2013 à 00:45 +0200, Yann E. MORIN a écrit :
> > When searching for symbols, return the symbols sorted by relevance.
> > 
> > Sorting is done as thus:
> >   - first, symbols with a prompt,   [1]
> >   - then, smallest offset,          [2]
> >   - then, shortest match,           [3]
> >   - then, highest relative match,   [4]
> >   - finally, alphabetical sort      [5]
> > 
> > So, searching (eg.) for 'P.*CI' :
> 
> Nobody would actually search for that, so that's not a particularly good
> example to determine whether your sort order is sane or not.

Yes, this was just to explain the sorting algorithm with a simple
example.

[--SNIP--]
> > This heuristic tries hard to get interesting symbols first in the list.
> 
> I know I am the one whose question triggered this work from you, but in
> the end the "response" seems disproportionate. Having 5 different
> ordering rules is a lot, and that's quite a bit of code, which while not
> the most complex in the world, is still far from trivial and may require
> maintenance work in the future.

OK, I understand you concerns.

Michal, please do not apply this patch: I'll rewrite it with a simpler
heuristic as Jean suggested, and will re-submit a complete pull-request
later tonight.

> So I think it is more important to make it clearer that regular
> expressions are allowed, than to come up with a brilliant sort order. I
> know the help page says it, but the prompt itself only asks for a
> "(sub)string" and it is not immediately obvious (to me at least) that
> regular expressions are considered substrings.

OK, Ill see to rewite the title of the dialog box.

Thanks for the review!

Regards,
Yann E. MORIN.

Patch
diff mbox

diff --git a/scripts/kconfig/symbol.c b/scripts/kconfig/symbol.c
index ab8f4c8..d08e300 100644
--- a/scripts/kconfig/symbol.c
+++ b/scripts/kconfig/symbol.c
@@ -954,38 +954,116 @@  const char *sym_escape_string_value(const char *in)
 	return res;
 }
 
+struct sym_match {
+	struct symbol	*sym;
+	off_t		so, eo;
+};
+
+/* Compare matched symbols as thus:
+ * - first, symbols with a prompt,
+ * - then, smallest offset,
+ * - then, shortest match,
+ * - then, highest relative match,
+ * - finally, alphabetical sort
+ */
+static int sym_rel_comp( const void *sym1, const void *sym2 )
+{
+	struct sym_match *s1 = *(struct sym_match **)sym1;
+	struct sym_match *s2 = *(struct sym_match **)sym2;
+	struct property *p;
+	bool p1 = false, p2 = false;
+	int l1, l2, r1, r2;
+
+	for_all_prompts(s1->sym, p) {
+		p1 = true;
+	}
+	for_all_prompts(s2->sym, p) {
+		p2 = true;
+	}
+	if (p1 && !p2)
+		return -1;
+	if (!p1 && p2)
+		return 1;
+
+	if (s1->so < s2->so)
+		return -1;
+	if (s1->so > s2->so)
+		return 1;
+
+	l1 = s1->eo - s1->so;
+	l2 = s2->eo - s2->so;
+	if (l1>l2)
+		return 1;
+	if (l1<l2)
+		return -1;
+
+	r1 = 100*l1 / strlen(s1->sym->name);
+	r2 = 100*l2 / strlen(s2->sym->name);
+	if (r1 > r2)
+		return -1;
+	if (r1 < r2)
+		return 1;
+
+	return strcmp(s1->sym->name, s2->sym->name);
+}
+
 struct symbol **sym_re_search(const char *pattern)
 {
 	struct symbol *sym, **sym_arr = NULL;
+	struct sym_match **sym_match_arr = NULL;
 	int i, cnt, size;
 	regex_t re;
+	regmatch_t match[1];
 
 	cnt = size = 0;
 	/* Skip if empty */
 	if (strlen(pattern) == 0)
 		return NULL;
-	if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB|REG_ICASE))
+	if (regcomp(&re, pattern, REG_EXTENDED|REG_ICASE))
 		return NULL;
 
 	for_all_symbols(i, sym) {
+		struct sym_match *tmp_sym_match;
 		if (sym->flags & SYMBOL_CONST || !sym->name)
 			continue;
-		if (regexec(&re, sym->name, 0, NULL, 0))
+		if (regexec(&re, sym->name, 1, match, 0))
 			continue;
 		if (cnt + 1 >= size) {
-			void *tmp = sym_arr;
+			void *tmp;
 			size += 16;
-			sym_arr = realloc(sym_arr, size * sizeof(struct symbol *));
-			if (!sym_arr) {
-				free(tmp);
-				return NULL;
+			tmp = realloc(sym_match_arr, size * sizeof(struct sym_match *));
+			if (!tmp) {
+				goto sym_re_search_free;
 			}
+			sym_match_arr = tmp;
 		}
 		sym_calc_value(sym);
-		sym_arr[cnt++] = sym;
+		tmp_sym_match = (struct sym_match*)malloc(sizeof(struct sym_match));
+		if (!tmp_sym_match)
+			goto sym_re_search_free;
+		tmp_sym_match->sym = sym;
+		/* As regexec return 0, we know we have a match, so
+		 * we can use match[0].rm_[se]o without further checks
+		 */
+		tmp_sym_match->so = match[0].rm_so;
+		tmp_sym_match->eo = match[0].rm_eo;
+		sym_match_arr[cnt++] = tmp_sym_match;
 	}
-	if (sym_arr)
+	if (sym_match_arr) {
+		qsort(sym_match_arr, cnt, sizeof(struct sym_match*), sym_rel_comp);
+		sym_arr = malloc((cnt+1) * sizeof(struct symbol));
+		if (!sym_arr)
+			goto sym_re_search_free;
+		for (i = 0; i < cnt; i++)
+			sym_arr[i] = sym_match_arr[i]->sym;
 		sym_arr[cnt] = NULL;
+	}
+sym_re_search_free:
+	if (sym_match_arr) {
+		for (i = 0; i < cnt; i++)
+			free(sym_match_arr[i]);
+		free(sym_match_arr);
+	}
 	regfree(&re);
 
 	return sym_arr;