Message ID | 571A82B8.5080908@twiddle.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote: > FWIW, so that I could get an idea of how the stats change as we improve the > hashing, I inserted the attachment 1 patch between patches 5 and 6, and with > attachment 2 attempting to fix the accounting for patches 9 and 10. For qht, I dislike the approach of reporting "avg chain" per-element, instead of per-bucket. Performance for a bucket whose entries are all valid is virtually the same as that of a bucket that only has one valid element; thus, with per-bucket reporting, we'd say that the chain lenght is 1 in both cases, i.e. "perfect". With per-element reporting, we'd report 4 (on a 64-bit host, since that's the value of QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the wrong idea (users would think they're in trouble, when they're not). Using the avg-bucket-chain metric you can test how good the hashing is. For instance, the metric is 1.01 for xxhash with phys_pc, pc and flags (i.e. func5), and 1.21 if func5 takes only a valid phys_pc (the other two are 0). I think reporting fully empty buckets as well as the longest chain (of buckets for qht) in addition to this metric is a good idea, though. > For booting an alpha kernel to login prompt: > > Before hashing changes (@5/11) > > TB count 175363/671088 > TB invalidate count 3996 > TB hash buckets 31731/32768 > TB hash avg chain 5.289 max=59 > > After xxhash patch (@7/11) > > TB hash buckets 32582/32768 > TB hash avg chain 5.260 max=18 > > So far so good! > > After qht patches (@11/11) > > TB hash buckets 94360/131072 > TB hash avg chain 1.774 max=8 > > Do note that those last numbers are off: 1.774 avg * 94360 used buckets = > 167394 total entries, which is far from 171367, the correct number of total > entries. If those numbers are off, then either this assert(hinfo.used_entries == tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count); should trigger, or the accounting isn't right. Another option is that the "TB count - invalidate_count" is different for each test you ran. I think this is what's going on, otherwise we couldn't explain why the first report ("before 5/11") is also "wrong": 5.289*31731=167825.259 Only the second report ("after 7/11") seems good (taking into account lack of precision of just 3 decimals): 5.26*32582=171381.32 ~= 171367 which leads me to believe that you've used the TB and invalidate counts from that test. I just tested your patches (on an ARM bootup) and the assert doesn't trigger, and the stats are spot on for "after 11/11": TB count 643610/2684354 TB hash buckets 369534/524288 TB hash avg chain 1.729 max=8 TB flush count 0 TB invalidate count 4718 1.729*369534=638924.286, which is ~= 643610-4718 = 638892. > I'm tempted to pull over gcc's non-chaining hash table implementation > (libiberty/hashtab.c, still gplv2+) and compare... You can try, but I think performance wouldn't be great, because the comparison function would be called way too often due to the ht using open addressing. The problem there is not only the comparisons themselves, but the all the cache lines needed to read the fields of the comparison. I haven't tested libiberty's htable but I did test the htable in concurrencykit[1], which also uses open addressing. With ck's ht, performance was not good when booting ARM: IIRC ~30% of runtime was spent on tb_cmp(); I also added the full hash to each TB so that it would be compared first, but it didn't make a difference since the delay was due to loading the cache line (I saw this with perf(1)'s annotated code, which showed that ~80% of the time spent in tb_cmp() was in performing the first load of the TB's fields). This led me to a design that had buckets with a small set of hash & pointer pairs, all in the same cache line as the head (then I discovered somebody else had thought of this, and that's why there's a link to the CLHT paper in qht.c). BTW I tested ck's htable also because of a requirement we have for MTTCG, which is to support lock-free concurrent lookups. AFAICT libiberty's ht doesn't support this, so it might be a bit faster than ck's. Thanks, Emilio [1] http://concurrencykit.org/ More info on their htable implementation here: http://backtrace.io/blog/blog/2015/03/13/workload-specialization/
On 04/22/2016 04:57 PM, Emilio G. Cota wrote: > On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote: >> FWIW, so that I could get an idea of how the stats change as we improve the >> hashing, I inserted the attachment 1 patch between patches 5 and 6, and with >> attachment 2 attempting to fix the accounting for patches 9 and 10. > > For qht, I dislike the approach of reporting "avg chain" per-element, > instead of per-bucket. Performance for a bucket whose entries are > all valid is virtually the same as that of a bucket that only > has one valid element; thus, with per-bucket reporting, we'd say that > the chain lenght is 1 in both cases, i.e. "perfect". With per-element > reporting, we'd report 4 (on a 64-bit host, since that's the value of > QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the > wrong idea (users would think they're in trouble, when they're not). But otherwise you have no way of knowing how full the buckets are. The bucket size is just something that you have to keep in mind. > If those numbers are off, then either this > assert(hinfo.used_entries == > tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count); > should trigger, or the accounting isn't right. I think I used an NDEBUG build, so these weren't effective. > Only the second report ("after 7/11") seems good (taking into account > lack of precision of just 3 decimals): > 5.26*32582=171381.32 ~= 171367 > which leads me to believe that you've used the TB and invalidate > counts from that test. The TB and invalidate numbers are repeatable; the same every time. > You can try, but I think performance wouldn't be great, because > the comparison function would be called way too often due to the > ht using open addressing. The problem there is not only the comparisons > themselves, but the all the cache lines needed to read the fields of > the comparison. I haven't tested libiberty's htable but I did test > the htable in concurrencykit[1], which also uses open addressing. You are right that having the full hash for primary comparison is a big win, especially with how complex our comparison functions are. And you're right that we have to have two of them. > This led me to a design that had buckets with a small set of > hash & pointer pairs, all in the same cache line as the head (then > I discovered somebody else had thought of this, and that's why there's > a link to the CLHT paper in qht.c). Fair. It's a good design. r~
On Sun, Apr 24, 2016 at 12:46:08 -0700, Richard Henderson wrote: > On 04/22/2016 04:57 PM, Emilio G. Cota wrote: > >On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote: > >>FWIW, so that I could get an idea of how the stats change as we improve the > >>hashing, I inserted the attachment 1 patch between patches 5 and 6, and with > >>attachment 2 attempting to fix the accounting for patches 9 and 10. > > > >For qht, I dislike the approach of reporting "avg chain" per-element, > >instead of per-bucket. Performance for a bucket whose entries are > >all valid is virtually the same as that of a bucket that only > >has one valid element; thus, with per-bucket reporting, we'd say that > >the chain lenght is 1 in both cases, i.e. "perfect". With per-element > >reporting, we'd report 4 (on a 64-bit host, since that's the value of > >QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the > >wrong idea (users would think they're in trouble, when they're not). > > But otherwise you have no way of knowing how full the buckets are. The > bucket size is just something that you have to keep in mind. I'll make some changes in v4 that I think will address both your and my concerns: - Report the number of empty buckets - Do not count empty buckets when reporting avg bucket chain length - Report average bucket occupancy (in %, so that QHT_BUCKET_ENTRIES does not have to be reported.) > >If those numbers are off, then either this > > assert(hinfo.used_entries == > > tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count); > >should trigger, or the accounting isn't right. > > I think I used an NDEBUG build, so these weren't effective. > > >Only the second report ("after 7/11") seems good (taking into account > >lack of precision of just 3 decimals): > > 5.26*32582=171381.32 ~= 171367 > >which leads me to believe that you've used the TB and invalidate > >counts from that test. > > The TB and invalidate numbers are repeatable; the same every time. Then something else is going on, because both the 1st and 3rd tests are way off. I'd re-test with assertions enabled. Thanks, Emilio
On Sun, Apr 24, 2016 at 18:06:51 -0400, Emilio G. Cota wrote: > On Sun, Apr 24, 2016 at 12:46:08 -0700, Richard Henderson wrote: > > On 04/22/2016 04:57 PM, Emilio G. Cota wrote: > > >On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote: > > >>FWIW, so that I could get an idea of how the stats change as we improve the > > >>hashing, I inserted the attachment 1 patch between patches 5 and 6, and with > > >>attachment 2 attempting to fix the accounting for patches 9 and 10. > > > > > >For qht, I dislike the approach of reporting "avg chain" per-element, > > >instead of per-bucket. Performance for a bucket whose entries are > > >all valid is virtually the same as that of a bucket that only > > >has one valid element; thus, with per-bucket reporting, we'd say that > > >the chain lenght is 1 in both cases, i.e. "perfect". With per-element > > >reporting, we'd report 4 (on a 64-bit host, since that's the value of > > >QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the > > >wrong idea (users would think they're in trouble, when they're not). > > > > But otherwise you have no way of knowing how full the buckets are. The > > bucket size is just something that you have to keep in mind. > > I'll make some changes in v4 that I think will address both your and > my concerns: > - Report the number of empty buckets > - Do not count empty buckets when reporting avg bucket chain length > - Report average bucket occupancy (in %, so that QHT_BUCKET_ENTRIES > does not have to be reported.) How does the following look? Example with good hashing, i.e. func5(phys_pc, pc, flags): TB count 704242/1342156 [...] TB hash buckets 386484/524288 (73.72% used) TB hash occupancy 32.57% avg chain occupancy. Histogram: 0-10%??????????90-100% TB hash avg chain 1.02 buckets. Histogram: 1???3 Example with bad hashing, i.e. func5(phys_pc, 0, 0): TB count 710748/1342156 [...] TB hash buckets 113569/524288 (21.66% used) TB hash occupancy 10.24% avg chain occupancy. Histogram: 0-10%??????????90-100% TB hash avg chain 2.11 buckets. Histogram: 1??????????93 Note that: - "TB hash avg chain" does _not_ count empty buckets. This gives an idea of how many buckets a typical hit goes through. - "TB hash occupancy" _does_ count empty buckets. It is called "avg chain occupancy" and not "avg occupancy" because the counts are only valid per-chain due to the seqlock protecting each chain. Thanks, Emilio
On 04/26/2016 07:43 PM, Emilio G. Cota wrote: > How does the following look? > > Example with good hashing, i.e. func5(phys_pc, pc, flags): > TB count 704242/1342156 > [...] > TB hash buckets 386484/524288 (73.72% used) > TB hash occupancy 32.57% avg chain occupancy. Histogram: 0-10%??????????90-100% > TB hash avg chain 1.02 buckets. Histogram: 1???3 > > Example with bad hashing, i.e. func5(phys_pc, 0, 0): > TB count 710748/1342156 > [...] > TB hash buckets 113569/524288 (21.66% used) > TB hash occupancy 10.24% avg chain occupancy. Histogram: 0-10%??????????90-100% > TB hash avg chain 2.11 buckets. Histogram: 1??????????93 > > Note that: > > - "TB hash avg chain" does _not_ count empty buckets. This gives > an idea of how many buckets a typical hit goes through. > > - "TB hash occupancy" _does_ count empty buckets. It is called > "avg chain occupancy" and not "avg occupancy" because the > counts are only valid per-chain due to the seqlock protecting > each chain. Looks really good. r~
diff --git a/translate-all.c b/translate-all.c index 1a8f68b..ed296d5 100644 --- a/translate-all.c +++ b/translate-all.c @@ -1671,39 +1671,55 @@ void tb_flush_jmp_cache(CPUState *cpu, target_ulong addr) void dump_exec_info(FILE *f, fprintf_function cpu_fprintf) { - int i, target_code_size, max_target_code_size; - int direct_jmp_count, direct_jmp2_count, cross_page; + size_t target_code_size, max_target_code_size; + unsigned direct_jmp_count, direct_jmp2_count, cross_page; + unsigned used_buckets, max_chain, hash_tbs; TranslationBlock *tb; + int i; target_code_size = 0; max_target_code_size = 0; cross_page = 0; direct_jmp_count = 0; direct_jmp2_count = 0; - for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) { - tb = &tcg_ctx.tb_ctx.tbs[i]; - target_code_size += tb->size; - if (tb->size > max_target_code_size) { - max_target_code_size = tb->size; - } - if (tb->page_addr[1] != -1) { - cross_page++; - } - if (tb->tb_next_offset[0] != 0xffff) { - direct_jmp_count++; - if (tb->tb_next_offset[1] != 0xffff) { - direct_jmp2_count++; + used_buckets = 0; + hash_tbs = 0; + max_chain = 0; + + for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) { + if (tcg_ctx.tb_ctx.tb_phys_hash[i]) { + unsigned this_chain = 0; + for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL; + tb = tb->phys_hash_next) { + this_chain++; + hash_tbs++; + target_code_size += tb->size; + if (tb->page_addr[1] != -1) { + cross_page++; + } + if (tb->tb_next_offset[0] != 0xffff) { + direct_jmp_count++; + if (tb->tb_next_offset[1] != 0xffff) { + direct_jmp2_count++; + } + } } + if (this_chain > max_chain) { + max_chain = this_chain; + } + used_buckets++; } } - /* XXX: avoid using doubles ? */ + assert(hash_tbs == + tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count); + cpu_fprintf(f, "Translation buffer state:\n"); cpu_fprintf(f, "gen code size %td/%zd\n", tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer, tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer); cpu_fprintf(f, "TB count %d/%d\n", tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.code_gen_max_blocks); - cpu_fprintf(f, "TB avg target size %d max=%d bytes\n", + cpu_fprintf(f, "TB avg target size %zd max=%zd bytes\n", tcg_ctx.tb_ctx.nb_tbs ? target_code_size / tcg_ctx.tb_ctx.nb_tbs : 0, max_target_code_size); @@ -1717,13 +1733,18 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf) cpu_fprintf(f, "cross page TB count %d (%d%%)\n", cross_page, tcg_ctx.tb_ctx.nb_tbs ? (cross_page * 100) / tcg_ctx.tb_ctx.nb_tbs : 0); - cpu_fprintf(f, "direct jump count %d (%d%%) (2 jumps=%d %d%%)\n", + cpu_fprintf(f, "direct jump count %u (%u%%) (2 jumps=%u %u%%)\n", direct_jmp_count, tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp_count * 100) / tcg_ctx.tb_ctx.nb_tbs : 0, direct_jmp2_count, tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) / tcg_ctx.tb_ctx.nb_tbs : 0); + cpu_fprintf(f, "TB hash buckets %u/%d\n", + used_buckets, CODE_GEN_PHYS_HASH_SIZE); + cpu_fprintf(f, "TB hash avg chain %0.3f max=%u\n", + used_buckets ? (double)hash_tbs / used_buckets : 0.0, + max_chain); cpu_fprintf(f, "\nStatistics:\n"); cpu_fprintf(f, "TB flush count %d\n", tcg_ctx.tb_ctx.tb_flush_count); cpu_fprintf(f, "TB invalidate count %d\n",