Message ID | 20161025153507.27110-1-bobby.prani@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 25/10/2016 17:35, Pranith Kumar wrote: > Using perf, I see that sequence lock is being a bottleneck since it is > being read by everyone. Giving it its own cache-line seems to help > things quite a bit. > > Using qht-bench, I measured the following for: > > $ ./tests/qht-bench -d 10 -n 24 -u <x> > > throughput base patch %change > update > 0 8.07 13.33 +65% > 10 7.10 8.90 +25% > 20 6.34 7.02 +10% > 30 5.48 6.11 +9.6% > 40 4.90 5.46 +11.42% > > I am not able to see any significant increases for lower thread counts though. Whoa, that's a bit of a heavy hammer. The idea is that whoever modifies the seqlock must take the spinlock first, and whoever reads the seqlock will read one of the members of hashes[]/pointers[]. So having everything in the same cacheline should be better. What happens if you just change QHT_BUCKET_ALIGN to 128? Paolo > Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> > --- > include/qemu/seqlock.h | 2 +- > util/qht.c | 4 ++-- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h > index 8dee11d..954abe8 100644 > --- a/include/qemu/seqlock.h > +++ b/include/qemu/seqlock.h > @@ -21,7 +21,7 @@ typedef struct QemuSeqLock QemuSeqLock; > > struct QemuSeqLock { > unsigned sequence; > -}; > +} QEMU_ALIGNED(64); > > static inline void seqlock_init(QemuSeqLock *sl) > { > diff --git a/util/qht.c b/util/qht.c > index ff4d2e6..4d82609 100644 > --- a/util/qht.c > +++ b/util/qht.c > @@ -101,14 +101,14 @@ > * be grabbed first. > */ > struct qht_bucket { > - QemuSpin lock; > QemuSeqLock sequence; > + QemuSpin lock; > uint32_t hashes[QHT_BUCKET_ENTRIES]; > void *pointers[QHT_BUCKET_ENTRIES]; > struct qht_bucket *next; > } QEMU_ALIGNED(QHT_BUCKET_ALIGN); > > -QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN); > +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > 2 * QHT_BUCKET_ALIGN); > > /** > * struct qht_map - structure to track an array of buckets >
On Tue, Oct 25, 2016 at 11:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 25/10/2016 17:35, Pranith Kumar wrote: >> Using perf, I see that sequence lock is being a bottleneck since it is >> being read by everyone. Giving it its own cache-line seems to help >> things quite a bit. >> >> Using qht-bench, I measured the following for: >> >> $ ./tests/qht-bench -d 10 -n 24 -u <x> >> >> throughput base patch %change >> update >> 0 8.07 13.33 +65% >> 10 7.10 8.90 +25% >> 20 6.34 7.02 +10% >> 30 5.48 6.11 +9.6% >> 40 4.90 5.46 +11.42% >> >> I am not able to see any significant increases for lower thread counts though. > > Whoa, that's a bit of a heavy hammer. The idea is that whoever modifies > the seqlock must take the spinlock first, and whoever reads the seqlock > will read one of the members of hashes[]/pointers[]. So having > everything in the same cacheline should be better. But we are taking the seqlock of only the head bucket, while the readers are reading hashes/pointers of the chained buckets. > > What happens if you just change QHT_BUCKET_ALIGN to 128? > 0 4.47 10 9.82 (weird!) 20 8.13 30 7.13 40 6.45 Thanks,
On 25/10/2016 17:49, Pranith Kumar wrote: > But we are taking the seqlock of only the head bucket, while the > readers are reading hashes/pointers of the chained buckets. No, we aren't. See qht_lookup__slowpath. This patch: throughput base patch %change update 0 8.07 13.33 +65% 10 7.10 8.90 +25% 20 6.34 7.02 +10% 30 5.48 6.11 +9.6% 40 4.90 5.46 +11.42% Just doubling the cachesize: throughput base patch %change update 0 8.07 4.47 -45% ?!? 10 7.10 9.82 +38% 20 6.34 8.13 +28% 30 5.48 7.13 +30% 40 5.90 6.45 +30% It seems to me that your machine has 128-byte cachelines. Paolo
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h index 8dee11d..954abe8 100644 --- a/include/qemu/seqlock.h +++ b/include/qemu/seqlock.h @@ -21,7 +21,7 @@ typedef struct QemuSeqLock QemuSeqLock; struct QemuSeqLock { unsigned sequence; -}; +} QEMU_ALIGNED(64); static inline void seqlock_init(QemuSeqLock *sl) { diff --git a/util/qht.c b/util/qht.c index ff4d2e6..4d82609 100644 --- a/util/qht.c +++ b/util/qht.c @@ -101,14 +101,14 @@ * be grabbed first. */ struct qht_bucket { - QemuSpin lock; QemuSeqLock sequence; + QemuSpin lock; uint32_t hashes[QHT_BUCKET_ENTRIES]; void *pointers[QHT_BUCKET_ENTRIES]; struct qht_bucket *next; } QEMU_ALIGNED(QHT_BUCKET_ALIGN); -QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN); +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > 2 * QHT_BUCKET_ALIGN); /** * struct qht_map - structure to track an array of buckets
Using perf, I see that sequence lock is being a bottleneck since it is being read by everyone. Giving it its own cache-line seems to help things quite a bit. Using qht-bench, I measured the following for: $ ./tests/qht-bench -d 10 -n 24 -u <x> throughput base patch %change update 0 8.07 13.33 +65% 10 7.10 8.90 +25% 20 6.34 7.02 +10% 30 5.48 6.11 +9.6% 40 4.90 5.46 +11.42% I am not able to see any significant increases for lower thread counts though. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> --- include/qemu/seqlock.h | 2 +- util/qht.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)