diff mbox

[v6,08/15] qdist: add module to represent frequency distributions of data

Message ID 20160608180645.GA14106@flamenco (mailing list archive)
State New, archived
Headers show

Commit Message

Emilio Cota June 8, 2016, 6:06 p.m. UTC
On Wed, Jun 08, 2016 at 17:10:03 +0300, Sergey Fedorov wrote:
> On 08/06/16 03:02, Emilio G. Cota wrote:
> > -    dist->entries = g_realloc(dist->entries,
> > -                              sizeof(*dist->entries) * (dist->n + 1));
> > +    if (unlikely(dist->n == dist->size)) {
> > +        dist->size = dist->size ? dist->size * 2 : 1;
> 
> We could initialize 'dist->size' to 1 and allocate a 1-entry
> 'dist->entries' array in qdist_init() to avoid this ternary operation ;-)

Done. This resulted in quite a few modifications, since dist->entries == NULL
had been used as an equivalent to dist->n == 0.

> >>>> (snip)
> >> So our scale is not linear. I think some users might get confused by this.
> > That's correct. I think special-casing 0 makes sense though, since
> > it increases the signal-to-noise ratio of the histogram. For example:
> >
> > 1) 0 as ' ':
> > TB hash occupancy   31.84% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
> > TB hash avg chain   1.015 buckets. Histogram: 1|█▁▁|3
> >
> > 2) 0 as '1/8':
> > TB hash occupancy   32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]%
> > TB hash avg chain   1.015 buckets. Histogram: 1|▇▁▁|3
> >
> > I think in these examples most users would be less confused by 1) than by 2).
> 
> I was meaning to represent all bars whose value < 1/8 as a space, not
> only whose value is pure zero. Otherwise we can see 1/8 bar where the
> actual value is negligibly differ from zero as in the second example.

I see. That would be (3):

TB hash occupancy   32.79% avg chain occ. Histogram: [0,10)%|▅ █  ▅ ▂  |[90,100]%
TB hash avg chain   1.017 buckets. Histogram: 1|█  |3

I still think (1) is the representation that gives the most information.
IMO it's valuable that "close to zero" and "zero" are represented differently,
in the same way that max and "close to max" are represented differently
as well (only max gets 8/8).

BTW, while looking into this I fixed a bug; sometimes we'd print 7/8 instead
of 8/8 for the max value, due to the ordering of FP computations
[see 1-3 in (2) above; it's 7/8 instead of 8/8]. Fixed with:


Thanks,

		Emilio

Comments

Sergey Fedorov June 8, 2016, 6:18 p.m. UTC | #1
On 08/06/16 21:06, Emilio G. Cota wrote:
> On Wed, Jun 08, 2016 at 17:10:03 +0300, Sergey Fedorov wrote:
>> On 08/06/16 03:02, Emilio G. Cota wrote:
>>> -    dist->entries = g_realloc(dist->entries,
>>> -                              sizeof(*dist->entries) * (dist->n + 1));
>>> +    if (unlikely(dist->n == dist->size)) {
>>> +        dist->size = dist->size ? dist->size * 2 : 1;
>> We could initialize 'dist->size' to 1 and allocate a 1-entry
>> 'dist->entries' array in qdist_init() to avoid this ternary operation ;-)
> Done. This resulted in quite a few modifications, since dist->entries == NULL
> had been used as an equivalent to dist->n == 0.
>
>>>>>> (snip)
>>>> So our scale is not linear. I think some users might get confused by this.
>>> That's correct. I think special-casing 0 makes sense though, since
>>> it increases the signal-to-noise ratio of the histogram. For example:
>>>
>>> 1) 0 as ' ':
>>> TB hash occupancy   31.84% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
>>> TB hash avg chain   1.015 buckets. Histogram: 1|█▁▁|3
>>>
>>> 2) 0 as '1/8':
>>> TB hash occupancy   32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]%
>>> TB hash avg chain   1.015 buckets. Histogram: 1|▇▁▁|3
>>>
>>> I think in these examples most users would be less confused by 1) than by 2).
>> I was meaning to represent all bars whose value < 1/8 as a space, not
>> only whose value is pure zero. Otherwise we can see 1/8 bar where the
>> actual value is negligibly differ from zero as in the second example.
> I see. That would be (3):
>
> TB hash occupancy   32.79% avg chain occ. Histogram: [0,10)%|▅ █  ▅ ▂  |[90,100]%
> TB hash avg chain   1.017 buckets. Histogram: 1|█  |3
>
> I still think (1) is the representation that gives the most information.
> IMO it's valuable that "close to zero" and "zero" are represented differently,
> in the same way that max and "close to max" are represented differently
> as well (only max gets 8/8).

Anyhow, these histograms are only for rough estimation and to see some
nice unicode output :)

>
> BTW, while looking into this I fixed a bug; sometimes we'd print 7/8 instead
> of 8/8 for the max value, due to the ordering of FP computations
> [see 1-3 in (2) above; it's 7/8 instead of 8/8]. Fixed with:
>
> diff --git a/util/qdist.c b/util/qdist.c
> index cfe09e6..7842d34 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -103,7 +103,7 @@ static const gunichar qdist_blocks[] = {
>   */
>  static char *qdist_pr_internal(const struct qdist *dist)
>  {
> -    double min, max, step;
> +    double min, max;
>      GString *s = g_string_new("");
>      size_t i;
>  
> @@ -131,16 +131,14 @@ static char *qdist_pr_internal(const struct qdist *dist)
>          }
>      }
>  
> -    /* floor((count - min) * step) will give us the block index */
> -    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
> -
>      for (i = 0; i < dist->n; i++) {
>          struct qdist_entry *e = &dist->entries[i];
>          int index;
>  
>          /* make an exception with 0; instead of using block[0], print a space */
>          if (e->count) {
> -            index = (int)((e->count - min) * step);
> +            /* divide first to avoid loss of precision when e->count == max */
> +            index = (e->count - min) / (max - min) * (QDIST_NR_BLOCK_CODES - 1);
>              g_string_append_unichar(s, qdist_blocks[index]);
>          } else {
>              g_string_append_c(s, ' ');
>
> I also added a test to test-qdist (called "test_bin_precision") that
> checks for this.
>
>>> The behaviour isn't the same though. With this we have
>>> that the two outer bins (leftmost and rightmost) are unnecessarily
>>> large (since they're out of the range of the input data).
>>>
>>> For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
>>> it makes no sense to report the first bin as [-12.5,12.5). If we
>>> then truncate the unnecessary edges, we'd have [0,12.5), but
>>> then the second bin is [12.5,37.5). Bins of unequal size are
>>> possible (although a bit unusual) in histograms, but given
>>> our Unicode-based representation, we're limited to same-width bars.
>> That is why I noted that I'm not sure what is the most correct from
>> mathematical point of view. Maybe consider the second option? I.e.
>> rounding to the middle of each bin with:
>>
>>     x = left + step / 2;
>>
>> which would give the picture like this:
>>
>>
>>     xmin [*---*---*---*---*] xmax   -- from
>>           |   |   |   |   |
>>            \ / \ / \ / \ /
>>             |   |   |   |
>>             V   V   V   V
>>            [*   *   *   *]          -- to
> This binning is equivalent to what we do right now.
>
> The only difference is where the value is set (either at the left
> of the bin, or at the center as above); this, however, isn't too
> important, since this value is only used when printing
> the labels, i.e. we could print [left, left+step) or
> [center-step/2, center+step/2) and still get the same results.
>
>> Anyway, you may consider if you like whether it's possible to apply some
>> simplifications from my code to the final version.
> OK. This is how it looks like:
>
> diff --git a/util/qdist.c b/util/qdist.c
> index 7842d34..3ca2227 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -163,7 +163,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
>  {
>      double xmin, xmax;
>      double step;
> -    size_t i, j, j_min;
> +    size_t i, j;
>  
>      qdist_init(to);
>  
> @@ -194,7 +194,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
>      }
>  
>   rebin:
> -    j_min = 0;
> +    j = 0;
>      for (i = 0; i < n; i++) {
>          double x;
>          double left, right;
> @@ -210,19 +210,13 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
>           * To avoid double-counting we capture [left, right) ranges, except for
>           * the righmost bin, which captures a [left, right] range.
>           */
> -        for (j = j_min; j < from->n; j++) {
> +        while (j < from->n &&
> +               (from->entries[j].x < right ||
> +                (i == n - 1 && from->entries[j].x == right))) {

If "i == n - 1" then we have to put everything what's left in 'from',
right? If so, then:

while (j < from->n && (from->entries[j].x < right || i == n - 1)) {

Kind regards,
Sergey

>              struct qdist_entry *o = &from->entries[j];
>  
> -            /* entries are ordered so do not check beyond right */
> -            if (o->x > right) {
> -                break;
> -            }
> -            if (o->x >= left && (o->x < right ||
> -                                   (i == n - 1 && o->x == right))) {
> -                qdist_add(to, x, o->count);
> -                /* don't check this entry again */
> -                j_min = j + 1;
> -            }
> +            qdist_add(to, x, o->count);
> +            j++;
>          }
>      }
>  }
>
> Thanks,
>
> 		Emilio
diff mbox

Patch

diff --git a/util/qdist.c b/util/qdist.c
index cfe09e6..7842d34 100644
--- a/util/qdist.c
+++ b/util/qdist.c
@@ -103,7 +103,7 @@  static const gunichar qdist_blocks[] = {
  */
 static char *qdist_pr_internal(const struct qdist *dist)
 {
-    double min, max, step;
+    double min, max;
     GString *s = g_string_new("");
     size_t i;
 
@@ -131,16 +131,14 @@  static char *qdist_pr_internal(const struct qdist *dist)
         }
     }
 
-    /* floor((count - min) * step) will give us the block index */
-    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
-
     for (i = 0; i < dist->n; i++) {
         struct qdist_entry *e = &dist->entries[i];
         int index;
 
         /* make an exception with 0; instead of using block[0], print a space */
         if (e->count) {
-            index = (int)((e->count - min) * step);
+            /* divide first to avoid loss of precision when e->count == max */
+            index = (e->count - min) / (max - min) * (QDIST_NR_BLOCK_CODES - 1);
             g_string_append_unichar(s, qdist_blocks[index]);
         } else {
             g_string_append_c(s, ' ');

I also added a test to test-qdist (called "test_bin_precision") that
checks for this.

> > The behaviour isn't the same though. With this we have
> > that the two outer bins (leftmost and rightmost) are unnecessarily
> > large (since they're out of the range of the input data).
> >
> > For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
> > it makes no sense to report the first bin as [-12.5,12.5). If we
> > then truncate the unnecessary edges, we'd have [0,12.5), but
> > then the second bin is [12.5,37.5). Bins of unequal size are
> > possible (although a bit unusual) in histograms, but given
> > our Unicode-based representation, we're limited to same-width bars.
> 
> That is why I noted that I'm not sure what is the most correct from
> mathematical point of view. Maybe consider the second option? I.e.
> rounding to the middle of each bin with:
> 
>     x = left + step / 2;
> 
> which would give the picture like this:
> 
> 
>     xmin [*---*---*---*---*] xmax   -- from
>           |   |   |   |   |
>            \ / \ / \ / \ /
>             |   |   |   |
>             V   V   V   V
>            [*   *   *   *]          -- to

This binning is equivalent to what we do right now.

The only difference is where the value is set (either at the left
of the bin, or at the center as above); this, however, isn't too
important, since this value is only used when printing
the labels, i.e. we could print [left, left+step) or
[center-step/2, center+step/2) and still get the same results.

> Anyway, you may consider if you like whether it's possible to apply some
> simplifications from my code to the final version.

OK. This is how it looks like:

diff --git a/util/qdist.c b/util/qdist.c
index 7842d34..3ca2227 100644
--- a/util/qdist.c
+++ b/util/qdist.c
@@ -163,7 +163,7 @@  void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
 {
     double xmin, xmax;
     double step;
-    size_t i, j, j_min;
+    size_t i, j;
 
     qdist_init(to);
 
@@ -194,7 +194,7 @@  void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
     }
 
  rebin:
-    j_min = 0;
+    j = 0;
     for (i = 0; i < n; i++) {
         double x;
         double left, right;
@@ -210,19 +210,13 @@  void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
          * To avoid double-counting we capture [left, right) ranges, except for
          * the righmost bin, which captures a [left, right] range.
          */
-        for (j = j_min; j < from->n; j++) {
+        while (j < from->n &&
+               (from->entries[j].x < right ||
+                (i == n - 1 && from->entries[j].x == right))) {
             struct qdist_entry *o = &from->entries[j];
 
-            /* entries are ordered so do not check beyond right */
-            if (o->x > right) {
-                break;
-            }
-            if (o->x >= left && (o->x < right ||
-                                   (i == n - 1 && o->x == right))) {
-                qdist_add(to, x, o->count);
-                /* don't check this entry again */
-                j_min = j + 1;
-            }
+            qdist_add(to, x, o->count);
+            j++;
         }
     }
 }