From patchwork Wed Jun 8 18:06:45 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Emilio Cota X-Patchwork-Id: 9165413 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 8367660572 for ; Wed, 8 Jun 2016 18:07:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 79091264F4 for ; Wed, 8 Jun 2016 18:07:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6B67C281FE; Wed, 8 Jun 2016 18:07:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 9A708264F4 for ; Wed, 8 Jun 2016 18:07:22 +0000 (UTC) Received: from localhost ([::1]:58739 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bAhsv-0004pV-DA for patchwork-qemu-devel@patchwork.kernel.org; Wed, 08 Jun 2016 14:07:21 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34483) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bAhsc-0004oO-Ou for qemu-devel@nongnu.org; Wed, 08 Jun 2016 14:07:04 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bAhsX-0004ZS-Mx for qemu-devel@nongnu.org; Wed, 08 Jun 2016 14:07:02 -0400 Received: from out2-smtp.messagingengine.com ([66.111.4.26]:48295) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bAhsV-0004Y3-Cd for qemu-devel@nongnu.org; Wed, 08 Jun 2016 14:06:57 -0400 Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id 6895921C06; Wed, 8 Jun 2016 14:06:46 -0400 (EDT) Received: from frontend1 ([10.202.2.160]) by compute5.internal (MEProxy); Wed, 08 Jun 2016 14:06:46 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=braap.org; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-sasl-enc :x-sasl-enc; s=mesmtp; bh=NFKr+0SUB0Q6vHSbscXj+4NMucI=; b=pYQPMC KEMhk57B5mDUZ73XUMv04bXfZSAkr3fB2KtymWB4OudxCH9/wg58dmWl+YAfupLv p3kvtwffNxDfTH7QiXdGyVe71MiYsD91bnL2Si6YOc/JfswqNAc9oSrZbGIbJYhy egcjMmvk832xVzPyhD2E8VuKRZFy+HvsefA3w= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-sasl-enc:x-sasl-enc; s=smtpout; bh=NFKr+0SUB0Q6vHS bscXj+4NMucI=; b=DSVFUzIRi/miiX1tpAZauDUZ41rjQfJpVzEvxEWGey8KEUL zeaRUBvP9ruvaymik9qzyvNwtvbYf8ateO2uIV+NLfToP/HR6zQbwjsaXW7uvg2C zDEipPpbNzzjIeDabeBKSJJyMvAsrC7DVlruNHnmP+6S9Qk65ilO5PheBm2o= X-Sasl-enc: vQwnvMolZHT070TE61k/hRi1qjtHF12aI1o85MWg905N 1465409206 Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216]) by mail.messagingengine.com (Postfix) with ESMTPA id 18FA2F29FA; Wed, 8 Jun 2016 14:06:46 -0400 (EDT) Date: Wed, 8 Jun 2016 14:06:45 -0400 From: "Emilio G. Cota" To: Sergey Fedorov Message-ID: <20160608180645.GA14106@flamenco> References: <1464138802-23503-1-git-send-email-cota@braap.org> <1464138802-23503-9-git-send-email-cota@braap.org> <5749E02A.3080909@gmail.com> <20160607010545.GB4418@flamenco> <5756EEC0.8090502@gmail.com> <20160608000224.GB16255@flamenco> <5758273B.7090603@gmail.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <5758273B.7090603@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 66.111.4.26 Subject: Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: MTTCG Devel , Paolo Bonzini , Alex =?iso-8859-1?Q?Benn=E9e?= , QEMU Developers , Richard Henderson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP On Wed, Jun 08, 2016 at 17:10:03 +0300, Sergey Fedorov wrote: > On 08/06/16 03:02, Emilio G. Cota wrote: > > - dist->entries = g_realloc(dist->entries, > > - sizeof(*dist->entries) * (dist->n + 1)); > > + if (unlikely(dist->n == dist->size)) { > > + dist->size = dist->size ? dist->size * 2 : 1; > > We could initialize 'dist->size' to 1 and allocate a 1-entry > 'dist->entries' array in qdist_init() to avoid this ternary operation ;-) Done. This resulted in quite a few modifications, since dist->entries == NULL had been used as an equivalent to dist->n == 0. > >>>> (snip) > >> So our scale is not linear. I think some users might get confused by this. > > That's correct. I think special-casing 0 makes sense though, since > > it increases the signal-to-noise ratio of the histogram. For example: > > > > 1) 0 as ' ': > > TB hash occupancy 31.84% avg chain occ. Histogram: [0,10)%|▆ █ ▅▁▃▁▁|[90,100]% > > TB hash avg chain 1.015 buckets. Histogram: 1|█▁▁|3 > > > > 2) 0 as '1/8': > > TB hash occupancy 32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]% > > TB hash avg chain 1.015 buckets. Histogram: 1|▇▁▁|3 > > > > I think in these examples most users would be less confused by 1) than by 2). > > I was meaning to represent all bars whose value < 1/8 as a space, not > only whose value is pure zero. Otherwise we can see 1/8 bar where the > actual value is negligibly differ from zero as in the second example. I see. That would be (3): TB hash occupancy 32.79% avg chain occ. Histogram: [0,10)%|▅ █ ▅ ▂ |[90,100]% TB hash avg chain 1.017 buckets. Histogram: 1|█ |3 I still think (1) is the representation that gives the most information. IMO it's valuable that "close to zero" and "zero" are represented differently, in the same way that max and "close to max" are represented differently as well (only max gets 8/8). BTW, while looking into this I fixed a bug; sometimes we'd print 7/8 instead of 8/8 for the max value, due to the ordering of FP computations [see 1-3 in (2) above; it's 7/8 instead of 8/8]. Fixed with: Thanks, Emilio diff --git a/util/qdist.c b/util/qdist.c index cfe09e6..7842d34 100644 --- a/util/qdist.c +++ b/util/qdist.c @@ -103,7 +103,7 @@ static const gunichar qdist_blocks[] = { */ static char *qdist_pr_internal(const struct qdist *dist) { - double min, max, step; + double min, max; GString *s = g_string_new(""); size_t i; @@ -131,16 +131,14 @@ static char *qdist_pr_internal(const struct qdist *dist) } } - /* floor((count - min) * step) will give us the block index */ - step = (QDIST_NR_BLOCK_CODES - 1) / (max - min); - for (i = 0; i < dist->n; i++) { struct qdist_entry *e = &dist->entries[i]; int index; /* make an exception with 0; instead of using block[0], print a space */ if (e->count) { - index = (int)((e->count - min) * step); + /* divide first to avoid loss of precision when e->count == max */ + index = (e->count - min) / (max - min) * (QDIST_NR_BLOCK_CODES - 1); g_string_append_unichar(s, qdist_blocks[index]); } else { g_string_append_c(s, ' '); I also added a test to test-qdist (called "test_bin_precision") that checks for this. > > The behaviour isn't the same though. With this we have > > that the two outer bins (leftmost and rightmost) are unnecessarily > > large (since they're out of the range of the input data). > > > > For example, assume the data is between 0 and 100 and n=5 (i.e. step=25), > > it makes no sense to report the first bin as [-12.5,12.5). If we > > then truncate the unnecessary edges, we'd have [0,12.5), but > > then the second bin is [12.5,37.5). Bins of unequal size are > > possible (although a bit unusual) in histograms, but given > > our Unicode-based representation, we're limited to same-width bars. > > That is why I noted that I'm not sure what is the most correct from > mathematical point of view. Maybe consider the second option? I.e. > rounding to the middle of each bin with: > > x = left + step / 2; > > which would give the picture like this: > > > xmin [*---*---*---*---*] xmax -- from > | | | | | > \ / \ / \ / \ / > | | | | > V V V V > [* * * *] -- to This binning is equivalent to what we do right now. The only difference is where the value is set (either at the left of the bin, or at the center as above); this, however, isn't too important, since this value is only used when printing the labels, i.e. we could print [left, left+step) or [center-step/2, center+step/2) and still get the same results. > Anyway, you may consider if you like whether it's possible to apply some > simplifications from my code to the final version. OK. This is how it looks like: diff --git a/util/qdist.c b/util/qdist.c index 7842d34..3ca2227 100644 --- a/util/qdist.c +++ b/util/qdist.c @@ -163,7 +163,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n) { double xmin, xmax; double step; - size_t i, j, j_min; + size_t i, j; qdist_init(to); @@ -194,7 +194,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n) } rebin: - j_min = 0; + j = 0; for (i = 0; i < n; i++) { double x; double left, right; @@ -210,19 +210,13 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n) * To avoid double-counting we capture [left, right) ranges, except for * the righmost bin, which captures a [left, right] range. */ - for (j = j_min; j < from->n; j++) { + while (j < from->n && + (from->entries[j].x < right || + (i == n - 1 && from->entries[j].x == right))) { struct qdist_entry *o = &from->entries[j]; - /* entries are ordered so do not check beyond right */ - if (o->x > right) { - break; - } - if (o->x >= left && (o->x < right || - (i == n - 1 && o->x == right))) { - qdist_add(to, x, o->count); - /* don't check this entry again */ - j_min = j + 1; - } + qdist_add(to, x, o->count); + j++; } } }