diff mbox series

[6/6] selinux: improve symtab string hashing

Message ID 20230818151220.166215-5-cgzones@googlemail.com (mailing list archive)
State Changes Requested
Delegated to: Paul Moore
Headers show
Series [1/6] selinux: print sum of chain lengths^2 for hash tables | expand

Commit Message

Christian Göttsche Aug. 18, 2023, 3:12 p.m. UTC
The number of buckets is calculated by performing a binary AND against
the mask of the hash table, which is one less than its size (which is a
power of two).  This leads to all top bits being discarded, requiring
for short or similar inputs a hash function with a good avalanche
effect.

Use djb2a:

    # current
    common prefixes:  7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
    classes:  134 entries and 100/256 buckets used, longest chain length 5, sum of chain length^2 234
    roles:  15 entries and 6/16 buckets used, longest chain length 5, sum of chain length^2 57
    types:  4448 entries and 3016/8192 buckets used, longest chain length 41, sum of chain length^2 14922
    users:  7 entries and 3/8 buckets used, longest chain length 3, sum of chain length^2 17
    bools:  306 entries and 221/512 buckets used, longest chain length 4, sum of chain length^2 524
    levels:  1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
    categories:  1024 entries and 400/1024 buckets used, longest chain length 4, sum of chain length^2 2740

    # patch
    common prefixes:  7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
    classes:  134 entries and 101/256 buckets used, longest chain length 3, sum of chain length^2 210
    roles:  15 entries and 9/16 buckets used, longest chain length 3, sum of chain length^2 31
    types:  4448 entries and 3459/8192 buckets used, longest chain length 5, sum of chain length^2 6778
    users:  7 entries and 5/8 buckets used, longest chain length 3, sum of chain length^2 13
    bools:  306 entries and 236/512 buckets used, longest chain length 5, sum of chain length^2 470
    levels:  1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
    categories:  1024 entries and 518/1024 buckets used, longest chain length 7, sum of chain length^2 2992

Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
---
 security/selinux/ss/symtab.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

Comments

Stephen Smalley Sept. 7, 2023, 5:44 p.m. UTC | #1
On Fri, Aug 18, 2023 at 11:12 AM Christian Göttsche
<cgzones@googlemail.com> wrote:
>
> The number of buckets is calculated by performing a binary AND against
> the mask of the hash table, which is one less than its size (which is a
> power of two).  This leads to all top bits being discarded, requiring
> for short or similar inputs a hash function with a good avalanche
> effect.
>
> Use djb2a:
>
>     # current
>     common prefixes:  7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
>     classes:  134 entries and 100/256 buckets used, longest chain length 5, sum of chain length^2 234
>     roles:  15 entries and 6/16 buckets used, longest chain length 5, sum of chain length^2 57
>     types:  4448 entries and 3016/8192 buckets used, longest chain length 41, sum of chain length^2 14922
>     users:  7 entries and 3/8 buckets used, longest chain length 3, sum of chain length^2 17
>     bools:  306 entries and 221/512 buckets used, longest chain length 4, sum of chain length^2 524
>     levels:  1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
>     categories:  1024 entries and 400/1024 buckets used, longest chain length 4, sum of chain length^2 2740
>
>     # patch
>     common prefixes:  7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
>     classes:  134 entries and 101/256 buckets used, longest chain length 3, sum of chain length^2 210
>     roles:  15 entries and 9/16 buckets used, longest chain length 3, sum of chain length^2 31
>     types:  4448 entries and 3459/8192 buckets used, longest chain length 5, sum of chain length^2 6778
>     users:  7 entries and 5/8 buckets used, longest chain length 3, sum of chain length^2 13
>     bools:  306 entries and 236/512 buckets used, longest chain length 5, sum of chain length^2 470
>     levels:  1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
>     categories:  1024 entries and 518/1024 buckets used, longest chain length 7, sum of chain length^2 2992
>
> Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
> ---
>  security/selinux/ss/symtab.c | 18 ++++++++----------
>  1 file changed, 8 insertions(+), 10 deletions(-)
>
> diff --git a/security/selinux/ss/symtab.c b/security/selinux/ss/symtab.c
> index 43d7f0319ccd..b6761b96eee4 100644
> --- a/security/selinux/ss/symtab.c
> +++ b/security/selinux/ss/symtab.c
> @@ -11,16 +11,14 @@
>
>  static unsigned int symhash(const void *key)
>  {
> -       const char *p, *keyp;
> -       unsigned int size;
> -       unsigned int val;
> -
> -       val = 0;
> -       keyp = key;
> -       size = strlen(keyp);
> -       for (p = keyp; (p - keyp) < size; p++)
> -               val = (val << 4 | (val >> (8*sizeof(unsigned int)-4))) ^ (*p);
> -       return val;
> +       /* djb2a */

Do we need/want something that specifies the author/license (I assume
public domain) of this code?

> +       unsigned int hash = 5381;
> +       unsigned char c;
> +
> +       while ((c = *(const unsigned char *)key++))
> +               hash = ((hash << 5) + hash) ^ c;
> +
> +       return hash;
>  }
>
>  static int symcmp(const void *key1, const void *key2)
> --
> 2.40.1
>
Stephen Smalley Sept. 8, 2023, 6:49 p.m. UTC | #2
On Fri, Aug 18, 2023 at 11:12 AM Christian Göttsche
<cgzones@googlemail.com> wrote:
>
> The number of buckets is calculated by performing a binary AND against
> the mask of the hash table, which is one less than its size (which is a
> power of two).  This leads to all top bits being discarded, requiring
> for short or similar inputs a hash function with a good avalanche
> effect.
>
> Use djb2a:
>
>     # current
>     common prefixes:  7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
>     classes:  134 entries and 100/256 buckets used, longest chain length 5, sum of chain length^2 234
>     roles:  15 entries and 6/16 buckets used, longest chain length 5, sum of chain length^2 57
>     types:  4448 entries and 3016/8192 buckets used, longest chain length 41, sum of chain length^2 14922
>     users:  7 entries and 3/8 buckets used, longest chain length 3, sum of chain length^2 17
>     bools:  306 entries and 221/512 buckets used, longest chain length 4, sum of chain length^2 524
>     levels:  1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
>     categories:  1024 entries and 400/1024 buckets used, longest chain length 4, sum of chain length^2 2740
>
>     # patch
>     common prefixes:  7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
>     classes:  134 entries and 101/256 buckets used, longest chain length 3, sum of chain length^2 210
>     roles:  15 entries and 9/16 buckets used, longest chain length 3, sum of chain length^2 31
>     types:  4448 entries and 3459/8192 buckets used, longest chain length 5, sum of chain length^2 6778
>     users:  7 entries and 5/8 buckets used, longest chain length 3, sum of chain length^2 13
>     bools:  306 entries and 236/512 buckets used, longest chain length 5, sum of chain length^2 470
>     levels:  1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
>     categories:  1024 entries and 518/1024 buckets used, longest chain length 7, sum of chain length^2 2992
>
> Signed-off-by: Christian Göttsche <cgzones@googlemail.com>

Assuming this is in fact public domain,
Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Paul Moore Sept. 13, 2023, 5:46 p.m. UTC | #3
On Thu, Sep 7, 2023 at 1:44 PM Stephen Smalley
<stephen.smalley.work@gmail.com> wrote:
> On Fri, Aug 18, 2023 at 11:12 AM Christian Göttsche
> <cgzones@googlemail.com> wrote:
> >
> > The number of buckets is calculated by performing a binary AND against
> > the mask of the hash table, which is one less than its size (which is a
> > power of two).  This leads to all top bits being discarded, requiring
> > for short or similar inputs a hash function with a good avalanche
> > effect.
> >
> > Use djb2a:
> >
> >     # current
> >     common prefixes:  7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
> >     classes:  134 entries and 100/256 buckets used, longest chain length 5, sum of chain length^2 234
> >     roles:  15 entries and 6/16 buckets used, longest chain length 5, sum of chain length^2 57
> >     types:  4448 entries and 3016/8192 buckets used, longest chain length 41, sum of chain length^2 14922
> >     users:  7 entries and 3/8 buckets used, longest chain length 3, sum of chain length^2 17
> >     bools:  306 entries and 221/512 buckets used, longest chain length 4, sum of chain length^2 524
> >     levels:  1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
> >     categories:  1024 entries and 400/1024 buckets used, longest chain length 4, sum of chain length^2 2740
> >
> >     # patch
> >     common prefixes:  7 entries and 5/8 buckets used, longest chain length 2, sum of chain length^2 11
> >     classes:  134 entries and 101/256 buckets used, longest chain length 3, sum of chain length^2 210
> >     roles:  15 entries and 9/16 buckets used, longest chain length 3, sum of chain length^2 31
> >     types:  4448 entries and 3459/8192 buckets used, longest chain length 5, sum of chain length^2 6778
> >     users:  7 entries and 5/8 buckets used, longest chain length 3, sum of chain length^2 13
> >     bools:  306 entries and 236/512 buckets used, longest chain length 5, sum of chain length^2 470
> >     levels:  1 entries and 1/1 buckets used, longest chain length 1, sum of chain length^2 1
> >     categories:  1024 entries and 518/1024 buckets used, longest chain length 7, sum of chain length^2 2992
> >
> > Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
> > ---
> >  security/selinux/ss/symtab.c | 18 ++++++++----------
> >  1 file changed, 8 insertions(+), 10 deletions(-)
> >
> > diff --git a/security/selinux/ss/symtab.c b/security/selinux/ss/symtab.c
> > index 43d7f0319ccd..b6761b96eee4 100644
> > --- a/security/selinux/ss/symtab.c
> > +++ b/security/selinux/ss/symtab.c
> > @@ -11,16 +11,14 @@
> >
> >  static unsigned int symhash(const void *key)
> >  {
> > -       const char *p, *keyp;
> > -       unsigned int size;
> > -       unsigned int val;
> > -
> > -       val = 0;
> > -       keyp = key;
> > -       size = strlen(keyp);
> > -       for (p = keyp; (p - keyp) < size; p++)
> > -               val = (val << 4 | (val >> (8*sizeof(unsigned int)-4))) ^ (*p);
> > -       return val;
> > +       /* djb2a */
>
> Do we need/want something that specifies the author/license (I assume
> public domain) of this code?

It would be a good idea, yes.  I spent some time looking around, and
while there are many references to djb2 (and the djb2a xor version), I
had a hard time finding an explicit license statement from DJB.
However, in the cdb v0.75 release there is a source file, cdb_hash.c,
which has effectively the same hash (same algorithm, different
implementation) with a "/* Public domain. */" statement at the top.
IANAL, but I think that is good enough.

Christian, can you update this patch comments with this info?

http://cr.yp.to/cdb.html

> > +       unsigned int hash = 5381;
> > +       unsigned char c;
> > +
> > +       while ((c = *(const unsigned char *)key++))
> > +               hash = ((hash << 5) + hash) ^ c;
> > +
> > +       return hash;
> >  }
> >
> >  static int symcmp(const void *key1, const void *key2)
> > --
> > 2.40.1
diff mbox series

Patch

diff --git a/security/selinux/ss/symtab.c b/security/selinux/ss/symtab.c
index 43d7f0319ccd..b6761b96eee4 100644
--- a/security/selinux/ss/symtab.c
+++ b/security/selinux/ss/symtab.c
@@ -11,16 +11,14 @@ 
 
 static unsigned int symhash(const void *key)
 {
-	const char *p, *keyp;
-	unsigned int size;
-	unsigned int val;
-
-	val = 0;
-	keyp = key;
-	size = strlen(keyp);
-	for (p = keyp; (p - keyp) < size; p++)
-		val = (val << 4 | (val >> (8*sizeof(unsigned int)-4))) ^ (*p);
-	return val;
+	/* djb2a */
+	unsigned int hash = 5381;
+	unsigned char c;
+
+	while ((c = *(const unsigned char *)key++))
+		hash = ((hash << 5) + hash) ^ c;
+
+	return hash;
 }
 
 static int symcmp(const void *key1, const void *key2)