[v3] kconfig: sort found symbols by relevance
diff mbox

Message ID 1367874931-6251-1-git-send-email-yann.morin.1998@free.fr
State New, archived
Headers show

Commit Message

Yann E. MORIN May 6, 2013, 9:15 p.m. UTC
From: "Yann E. MORIN" <yann.morin.1998@free.fr>

When searching for symbols, return the symbols sorted by relevance.

Sorting is done as thus:
  - first, symbols with a prompt,   [1]
  - then, smallest offset,          [2]
  - then, shortest match,           [3]
  - then, highest relative match,   [4]
  - finally, alphabetical sort      [5]

So, searching (eg.) for 'P.*CI' :

[1] Symbols of interest are probably those with a prompt, as they can be
    changed, while symbols with no prompt are only for info. Thus:
        PCIEASPM comes before PCI_ATS

[2] Symbols that match earlier in the name are to be preferred over
    symbols which match later. Thus:
        PCI_MSI comes before WDTPCI

[3] The shortest match is (IMHO) more interesting than a longer one.
    Thus:
        PCI comes before PCMCIA

[4] The relative match is the ratio of the length of the match against
    the length of the symbol. The more of a symbol name we match, the
    more instersting that symbol is. Thus:
        PCIEAER comes before PCIEASPM

[5] As fallback, sort symbols alphabetically

This heuristic tries hard to get interesting symbols first in the list.

In any case, exact match can (as previously) be requested by using
start-of-line and end-of-line in the searc regexp: ^PCI$

Reported-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: "Yann E. MORIN" <yann.morin.1998@free.fr>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Roland Eggner <edvx1@systemanalysen.net>
Cc: Wang YanQing <udknight@gmail.com>
---
 scripts/kconfig/symbol.c | 96 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 9 deletions(-)

Comments

wang yanqing May 9, 2013, 3:27 p.m. UTC | #1
On Mon, May 06, 2013 at 11:15:31PM +0200, Yann E. MORIN wrote:
> From: "Yann E. MORIN" <yann.morin.1998@free.fr>
> 
> When searching for symbols, return the symbols sorted by relevance.
> 
> Sorting is done as thus:
>   - first, symbols with a prompt,   [1]
>   - then, smallest offset,          [2]
>   - then, shortest match,           [3]
>   - then, highest relative match,   [4]
>   - finally, alphabetical sort      [5]
> 
> So, searching (eg.) for 'P.*CI' :
> 
> [1] Symbols of interest are probably those with a prompt, as they can be
>     changed, while symbols with no prompt are only for info. Thus:
>         PCIEASPM comes before PCI_ATS

Search the value of symbols with no prompt are useful too

> [2] Symbols that match earlier in the name are to be preferred over
>     symbols which match later. Thus:
>         PCI_MSI comes before WDTPCI

We can achieve this with ^PCI regular search

> [3] The shortest match is (IMHO) more interesting than a longer one.
>     Thus:
>         PCI comes before PCMCIA

We can achieve this with ^PCI regular search plus your previous heuristic

> [4] The relative match is the ratio of the length of the match against
>     the length of the symbol. The more of a symbol name we match, the
>     more instersting that symbol is. Thus:
>         PCIEAER comes before PCIEASPM

This is what your first heuristic

> [5] As fallback, sort symbols alphabetically
This is in your first heuristic too.

> This heuristic tries hard to get interesting symbols first in the list.
I don't mean your this heuristic is bad, but
maybe provide mechanism to user will make guesser life easier.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann E. MORIN May 9, 2013, 4:12 p.m. UTC | #2
Wang, All,

On Thu, May 09, 2013 at 11:27:31PM +0800, Wang YanQing wrote:
> On Mon, May 06, 2013 at 11:15:31PM +0200, Yann E. MORIN wrote:
> > From: "Yann E. MORIN" <yann.morin.1998@free.fr>
> > 
> > When searching for symbols, return the symbols sorted by relevance.
> > 
> > Sorting is done as thus:
> >   - first, symbols with a prompt,   [1]
> >   - then, smallest offset,          [2]
> >   - then, shortest match,           [3]
> >   - then, highest relative match,   [4]
> >   - finally, alphabetical sort      [5]
> > 
> > So, searching (eg.) for 'P.*CI' :
> > 
> > [1] Symbols of interest are probably those with a prompt, as they can be
> >     changed, while symbols with no prompt are only for info. Thus:
> >         PCIEASPM comes before PCI_ATS
> 
> Search the value of symbols with no prompt are useful too

My reasonning here is that often I need to toggle a symbol, but am not
sure what menu it's in, so I search for it.

Sometime, I also need to search for prompt-less symbols, too, to see why
a symbol I want is not visible. But that's not the most common use-case.

Also, the examples provided were with a very verbose search: 'P.*CI'
which is expected to return a hell of a lot of symbols. More precise
searches will yield substancially fewer symbols, so the search results
will probably fit in 1 or 2 pages, not 10+ like 'P.*CI'

> > [2] Symbols that match earlier in the name are to be preferred over
> >     symbols which match later. Thus:
> >         PCI_MSI comes before WDTPCI
> 
> We can achieve this with ^PCI regular search

Yes, but I expect users do enter the begining of symbols, not a string
in the middle or at the end.

Also, as Jean and Thomas have proven, not many people know that searches
can be regular expressions.

> > [3] The shortest match is (IMHO) more interesting than a longer one.
> >     Thus:
> >         PCI comes before PCMCIA
> 
> We can achieve this with ^PCI regular search plus your previous heuristic

This one is not about location of the match, it's about the length of
the match: the shortest the match, the more interesting it is (IMHO). So
we'd prefer FOO_PCI_BAR_BUZ over A_PACI_B  (when searching for 'P.*CI').

> > [4] The relative match is the ratio of the length of the match against
> >     the length of the symbol. The more of a symbol name we match, the
> >     more instersting that symbol is. Thus:
> >         PCIEAER comes before PCIEASPM
> 
> This is what your first heuristic
> 
> > [5] As fallback, sort symbols alphabetically
> This is in your first heuristic too.
> 
> > This heuristic tries hard to get interesting symbols first in the list.
> I don't mean your this heuristic is bad, but

This heuristic is just that: a heuristic. It kicks in only if the user
dit not provide an anchored regexp (ie with start- or end-of-line).

For the ten-or-so tests I did, the sorting was rather appropriate,
though that's only ten-or-so tests and is not exhaustive (and probably
subject to testing bias, too).

What I'm trying to achieve here is come up with a way to sort symbols
for searches of *non-anchored* regexps. Anchored regexps will always
provide the results they used to provide so far.

Maybe we can emphasize in the UI the fact that regexps, and especially
anchored regexps, can be used.

> maybe provide mechanism to user will make guesser life easier.

Sorry, I can't make sense of this sentence. :-( Can you elaborate a bit
what you meant, please?

Regards,
Yann E. MORIN.
wang yanqing May 10, 2013, 12:51 a.m. UTC | #3
On Thu, May 09, 2013 at 06:12:17PM +0200, Yann E. MORIN wrote:
> Wang, All,
> 
> On Thu, May 09, 2013 at 11:27:31PM +0800, Wang YanQing wrote:
> > On Mon, May 06, 2013 at 11:15:31PM +0200, Yann E. MORIN wrote:
> > > From: "Yann E. MORIN" <yann.morin.1998@free.fr>
> > > 
> > > When searching for symbols, return the symbols sorted by relevance.
> > > 
> > > Sorting is done as thus:
> > >   - first, symbols with a prompt,   [1]
> > >   - then, smallest offset,          [2]
> > >   - then, shortest match,           [3]
> > >   - then, highest relative match,   [4]
> > >   - finally, alphabetical sort      [5]
> > > 
> > > So, searching (eg.) for 'P.*CI' :
> > > 
> > > [1] Symbols of interest are probably those with a prompt, as they can be
> > >     changed, while symbols with no prompt are only for info. Thus:
> > >         PCIEASPM comes before PCI_ATS
> > 
> > Search the value of symbols with no prompt are useful too
> 
> My reasonning here is that often I need to toggle a symbol, but am not
> sure what menu it's in, so I search for it.
> 
> Sometime, I also need to search for prompt-less symbols, too, to see why
> a symbol I want is not visible. But that's not the most common use-case.
> 
> Also, the examples provided were with a very verbose search: 'P.*CI'
> which is expected to return a hell of a lot of symbols. More precise
> searches will yield substancially fewer symbols, so the search results
> will probably fit in 1 or 2 pages, not 10+ like 'P.*CI'
> 
> > > [2] Symbols that match earlier in the name are to be preferred over
> > >     symbols which match later. Thus:
> > >         PCI_MSI comes before WDTPCI
> > 
> > We can achieve this with ^PCI regular search
> 
> Yes, but I expect users do enter the begining of symbols, not a string
> in the middle or at the end.
> 
> Also, as Jean and Thomas have proven, not many people know that searches
> can be regular expressions.

Ok, indeed, I don't know regular expression in menuconfig before you
point it out :)

> > > [3] The shortest match is (IMHO) more interesting than a longer one.
> > >     Thus:
> > >         PCI comes before PCMCIA
> > 
> > We can achieve this with ^PCI regular search plus your previous heuristic
> 
> This one is not about location of the match, it's about the length of
> the match: the shortest the match, the more interesting it is (IMHO). So
> we'd prefer FOO_PCI_BAR_BUZ over A_PACI_B  (when searching for 'P.*CI').

Ok, I get your meaning.

> > > [4] The relative match is the ratio of the length of the match against
> > >     the length of the symbol. The more of a symbol name we match, the
> > >     more instersting that symbol is. Thus:
> > >         PCIEAER comes before PCIEASPM
> > 
> > This is what your first heuristic
> > 
> > > [5] As fallback, sort symbols alphabetically
> > This is in your first heuristic too.
> > 
> > > This heuristic tries hard to get interesting symbols first in the list.
> > I don't mean your this heuristic is bad, but
> 
> This heuristic is just that: a heuristic. It kicks in only if the user
> dit not provide an anchored regexp (ie with start- or end-of-line).
> 
> For the ten-or-so tests I did, the sorting was rather appropriate,
> though that's only ten-or-so tests and is not exhaustive (and probably
> subject to testing bias, too).

I hope more people will find the sorting is appropriate if Michal Marek accept 
it :)

> What I'm trying to achieve here is come up with a way to sort symbols
> for searches of *non-anchored* regexps. Anchored regexps will always
> provide the results they used to provide so far.
> 
> Maybe we can emphasize in the UI the fact that regexps, and especially
> anchored regexps, can be used.
> 
> > maybe provide mechanism to user will make guesser life easier.
> 
> Sorry, I can't make sense of this sentence. :-( Can you elaborate a bit
> what you meant, please?

The mechanism means first heuristic in previous Email patch.

I though we could get this heuristic's result by composite your first heuristic
in previous Email patch with regular expression, it seams like I lost something,
see above.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann E. MORIN May 10, 2013, 10:12 a.m. UTC | #4
Wang, All,

On Fri, May 10, 2013 at 08:51:25AM +0800, Wang YanQing wrote:
> On Thu, May 09, 2013 at 06:12:17PM +0200, Yann E. MORIN wrote:
[--SNIP--]
> > For the ten-or-so tests I did, the sorting was rather appropriate,
> > though that's only ten-or-so tests and is not exhaustive (and probably
> > subject to testing bias, too).
> 
> I hope more people will find the sorting is appropriate if Michal Marek accept 
> it :)

OK, I'll queue it in my tree, and will push it to Michal for 3.11, as it
is a bit too late for 3.10, now.

> > > maybe provide mechanism to user will make guesser life easier.
> > Sorry, I can't make sense of this sentence. :-( Can you elaborate a bit
> > what you meant, please?
> 
> The mechanism means first heuristic in previous Email patch.
> 
> I though we could get this heuristic's result by composite your first heuristic
> in previous Email patch with regular expression, it seams like I lost something,
> see above.

OK, so I'll keep it as-is (I will just fix some coding-style issues I've
spotted).

Thanks for the input!

Regards,
Yann E. MORIN.

Patch
diff mbox

diff --git a/scripts/kconfig/symbol.c b/scripts/kconfig/symbol.c
index ecc5aa5..20f8107 100644
--- a/scripts/kconfig/symbol.c
+++ b/scripts/kconfig/symbol.c
@@ -943,38 +943,116 @@  const char *sym_escape_string_value(const char *in)
 	return res;
 }
 
+struct sym_match {
+	struct symbol	*sym;
+	off_t		so, eo;
+};
+
+/* Compare matched symbols as thus:
+ * - first, symbols with a prompt,
+ * - then, smallest offset,
+ * - then, shortest match,
+ * - then, highest relative match,
+ * - finally, alphabetical sort
+ */
+static int sym_rel_comp( const void *sym1, const void *sym2 )
+{
+	struct sym_match *s1 = *(struct sym_match **)sym1;
+	struct sym_match *s2 = *(struct sym_match **)sym2;
+	struct property *p;
+	bool p1 = false, p2 = false;
+	int l1, l2, r1, r2;
+
+	for_all_prompts(s1->sym, p) {
+		p1 = true;
+	}
+	for_all_prompts(s2->sym, p) {
+		p2 = true;
+	}
+	if (p1 && !p2)
+		return -1;
+	if (!p1 && p2)
+		return 1;
+
+	if ( s1->so < s2->so )
+		return -1;
+	if ( s1->so > s2->so )
+		return 1;
+
+	l1 = s1->eo - s1->so;
+	l2 = s2->eo - s2->so;
+	if ( l1 > l2 )
+		return 1;
+	if ( l1 < l2 )
+		return -1;
+
+	r1 = 100*l1 / strlen(s1->sym->name);
+	r2 = 100*l2 / strlen(s2->sym->name);
+	if ( r1 > r2 )
+		return -1;
+	if ( r1 < r2 )
+		return 1;
+
+	return strcmp( s1->sym->name, s2->sym->name );
+}
+
 struct symbol **sym_re_search(const char *pattern)
 {
 	struct symbol *sym, **sym_arr = NULL;
+	struct sym_match **sym_match_arr = NULL;
 	int i, cnt, size;
 	regex_t re;
+	regmatch_t match[1];
 
 	cnt = size = 0;
 	/* Skip if empty */
 	if (strlen(pattern) == 0)
 		return NULL;
-	if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB|REG_ICASE))
+	if (regcomp(&re, pattern, REG_EXTENDED|REG_ICASE))
 		return NULL;
 
 	for_all_symbols(i, sym) {
+		struct sym_match *tmp_sym_match;
 		if (sym->flags & SYMBOL_CONST || !sym->name)
 			continue;
-		if (regexec(&re, sym->name, 0, NULL, 0))
+		if (regexec(&re, sym->name, 1, match, 0))
 			continue;
 		if (cnt + 1 >= size) {
-			void *tmp = sym_arr;
+			void *tmp;
 			size += 16;
-			sym_arr = realloc(sym_arr, size * sizeof(struct symbol *));
-			if (!sym_arr) {
-				free(tmp);
-				return NULL;
+			tmp = realloc(sym_match_arr, size * sizeof(struct sym_match *));
+			if (!tmp) {
+				goto sym_re_search_free;
 			}
+			sym_match_arr = tmp;
 		}
 		sym_calc_value(sym);
-		sym_arr[cnt++] = sym;
+		tmp_sym_match = (struct sym_match*)malloc(sizeof(struct sym_match));
+		if (!tmp_sym_match)
+			goto sym_re_search_free;
+		tmp_sym_match->sym = sym;
+		/* As regexec return 0, we know we have a match, so
+		 * we can use match[0].rm_[se]o without further checks
+		 */
+		tmp_sym_match->so = match[0].rm_so;
+		tmp_sym_match->eo = match[0].rm_eo;
+		sym_match_arr[cnt++] = tmp_sym_match;
 	}
-	if (sym_arr)
+	if( sym_match_arr ) {
+		qsort( sym_match_arr, cnt, sizeof(struct sym_match*), sym_rel_comp );
+		sym_arr = malloc( (cnt+1) * sizeof(struct symbol) );
+		if (!sym_arr)
+			goto sym_re_search_free;
+		for ( i=0; i<cnt; i++ )
+			sym_arr[i] = sym_match_arr[i]->sym;
 		sym_arr[cnt] = NULL;
+	}
+sym_re_search_free:
+	if (sym_match_arr) {
+		for ( i=0; i<cnt; i++ )
+			free( sym_match_arr[i] );
+		free( sym_match_arr );
+	}
 	regfree(&re);
 
 	return sym_arr;