Message ID | 20240521233100.358002-1-mjguzik@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] percpu_counter: add a cmpxchg-based _add_batch variant | expand |
Hi Mateusz, On Wed, May 22, 2024 at 01:31:00AM +0200, Mateusz Guzik wrote: > Interrupt disable/enable trips are quite expensive on x86-64 compared to > a mere cmpxchg (note: no lock prefix!) and percpu counters are used > quite often. > > With this change I get a bump of 1% ops/s for negative path lookups, > plugged into will-it-scale: > > void testcase(unsigned long long *iterations, unsigned long nr) > { > while (1) { > int fd = open("/tmp/nonexistent", O_RDONLY); > assert(fd == -1); > > (*iterations)++; > } > } > > The win would be higher if it was not for other slowdowns, but one has > to start somewhere. This is cool! > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > > v3: > - add a missing word to the new comment > > v2: > - dodge preemption > - use this_cpu_try_cmpxchg > - keep the old variant depending on CONFIG_HAVE_CMPXCHG_LOCAL > > lib/percpu_counter.c | 44 +++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 39 insertions(+), 5 deletions(-) > > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c > index 44dd133594d4..c3140276bb36 100644 > --- a/lib/percpu_counter.c > +++ b/lib/percpu_counter.c > @@ -73,17 +73,50 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount) > EXPORT_SYMBOL(percpu_counter_set); > > /* > - * local_irq_save() is needed to make the function irq safe: > - * - The slow path would be ok as protected by an irq-safe spinlock. > - * - this_cpu_add would be ok as it is irq-safe by definition. > - * But: > - * The decision slow path/fast path and the actual update must be atomic, too. > + * Add to a counter while respecting batch size. > + * > + * There are 2 implementations, both dealing with the following problem: > + * > + * The decision slow path/fast path and the actual update must be atomic. > * Otherwise a call in process context could check the current values and > * decide that the fast path can be used. If now an interrupt occurs before > * the this_cpu_add(), and the interrupt updates this_cpu(*fbc->counters), > * then the this_cpu_add() that is executed after the interrupt has completed > * can produce values larger than "batch" or even overflows. > */ > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL > +/* > + * Safety against interrupts is achieved in 2 ways: > + * 1. the fast path uses local cmpxchg (note: no lock prefix) > + * 2. the slow path operates with interrupts disabled > + */ > +void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > +{ > + s64 count; > + unsigned long flags; > + > + count = this_cpu_read(*fbc->counters); Should this_cpu_read() be inside the do {} while in case the extreme case that we get preempted after the read and before the cmpxchg AND count + amount < batch on both the previous and next cpu? > + do { > + if (unlikely(abs(count + amount)) >= batch) { > + raw_spin_lock_irqsave(&fbc->lock, flags); > + /* > + * Note: by now we might have migrated to another CPU > + * or the value might have changed. > + */ > + count = __this_cpu_read(*fbc->counters); > + fbc->count += count + amount; > + __this_cpu_sub(*fbc->counters, count); > + raw_spin_unlock_irqrestore(&fbc->lock, flags); > + return; > + } > + } while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount)); > +} > +#else > +/* > + * local_irq_save() is used to make the function irq safe: > + * - The slow path would be ok as protected by an irq-safe spinlock. > + * - this_cpu_add would be ok as it is irq-safe by definition. > + */ > void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > { > s64 count; > @@ -101,6 +134,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > } > local_irq_restore(flags); > } > +#endif > EXPORT_SYMBOL(percpu_counter_add_batch); > > /* > -- > 2.39.2 > Thanks, Dennis
On Wed, May 22, 2024 at 3:17 AM Dennis Zhou <dennis@kernel.org> wrote: > > Hi Mateusz, > > On Wed, May 22, 2024 at 01:31:00AM +0200, Mateusz Guzik wrote: > > Interrupt disable/enable trips are quite expensive on x86-64 compared to > > a mere cmpxchg (note: no lock prefix!) and percpu counters are used > > quite often. > > > > With this change I get a bump of 1% ops/s for negative path lookups, > > plugged into will-it-scale: > > > > void testcase(unsigned long long *iterations, unsigned long nr) > > { > > while (1) { > > int fd = open("/tmp/nonexistent", O_RDONLY); > > assert(fd == -1); > > > > (*iterations)++; > > } > > } > > > > The win would be higher if it was not for other slowdowns, but one has > > to start somewhere. > > This is cool! > > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > --- > > > > v3: > > - add a missing word to the new comment > > > > v2: > > - dodge preemption > > - use this_cpu_try_cmpxchg > > - keep the old variant depending on CONFIG_HAVE_CMPXCHG_LOCAL > > > > lib/percpu_counter.c | 44 +++++++++++++++++++++++++++++++++++++++----- > > 1 file changed, 39 insertions(+), 5 deletions(-) > > > > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c > > index 44dd133594d4..c3140276bb36 100644 > > --- a/lib/percpu_counter.c > > +++ b/lib/percpu_counter.c > > @@ -73,17 +73,50 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount) > > EXPORT_SYMBOL(percpu_counter_set); > > > > /* > > - * local_irq_save() is needed to make the function irq safe: > > - * - The slow path would be ok as protected by an irq-safe spinlock. > > - * - this_cpu_add would be ok as it is irq-safe by definition. > > - * But: > > - * The decision slow path/fast path and the actual update must be atomic, too. > > + * Add to a counter while respecting batch size. > > + * > > + * There are 2 implementations, both dealing with the following problem: > > + * > > + * The decision slow path/fast path and the actual update must be atomic. > > * Otherwise a call in process context could check the current values and > > * decide that the fast path can be used. If now an interrupt occurs before > > * the this_cpu_add(), and the interrupt updates this_cpu(*fbc->counters), > > * then the this_cpu_add() that is executed after the interrupt has completed > > * can produce values larger than "batch" or even overflows. > > */ > > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL > > +/* > > + * Safety against interrupts is achieved in 2 ways: > > + * 1. the fast path uses local cmpxchg (note: no lock prefix) > > + * 2. the slow path operates with interrupts disabled > > + */ > > +void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > > +{ > > + s64 count; > > + unsigned long flags; > > + > > + count = this_cpu_read(*fbc->counters); > > Should this_cpu_read() be inside the do {} while in case the extreme > case that we get preempted after the read and before the cmpxchg AND > count + amount < batch on both the previous and next cpu? > this_cpu_try_cmpxchg updates the local value on failure (hence &), so from semantic pov this is equivalent to having this_cpu_read in the loop. I'm using it the same way as mod_zone_state. > > + do { > > + if (unlikely(abs(count + amount)) >= batch) { > > + raw_spin_lock_irqsave(&fbc->lock, flags); > > + /* > > + * Note: by now we might have migrated to another CPU > > + * or the value might have changed. > > + */ > > + count = __this_cpu_read(*fbc->counters); > > + fbc->count += count + amount; > > + __this_cpu_sub(*fbc->counters, count); > > + raw_spin_unlock_irqrestore(&fbc->lock, flags); > > + return; > > + } > > + } while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount)); > > +} > > +#else > > +/* > > + * local_irq_save() is used to make the function irq safe: > > + * - The slow path would be ok as protected by an irq-safe spinlock. > > + * - this_cpu_add would be ok as it is irq-safe by definition. > > + */ > > void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > > { > > s64 count; > > @@ -101,6 +134,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > > } > > local_irq_restore(flags); > > } > > +#endif > > EXPORT_SYMBOL(percpu_counter_add_batch); > > > > /* > > -- > > 2.39.2 > > > > Thanks, > Dennis
On Wed, May 22, 2024 at 06:59:02AM +0200, Mateusz Guzik wrote: > On Wed, May 22, 2024 at 3:17 AM Dennis Zhou <dennis@kernel.org> wrote: > > > > Hi Mateusz, > > > > On Wed, May 22, 2024 at 01:31:00AM +0200, Mateusz Guzik wrote: > > > Interrupt disable/enable trips are quite expensive on x86-64 compared to > > > a mere cmpxchg (note: no lock prefix!) and percpu counters are used > > > quite often. > > > > > > With this change I get a bump of 1% ops/s for negative path lookups, > > > plugged into will-it-scale: > > > > > > void testcase(unsigned long long *iterations, unsigned long nr) > > > { > > > while (1) { > > > int fd = open("/tmp/nonexistent", O_RDONLY); > > > assert(fd == -1); > > > > > > (*iterations)++; > > > } > > > } > > > > > > The win would be higher if it was not for other slowdowns, but one has > > > to start somewhere. > > > > This is cool! > > > > > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > > --- > > > > > > v3: > > > - add a missing word to the new comment > > > > > > v2: > > > - dodge preemption > > > - use this_cpu_try_cmpxchg > > > - keep the old variant depending on CONFIG_HAVE_CMPXCHG_LOCAL > > > > > > lib/percpu_counter.c | 44 +++++++++++++++++++++++++++++++++++++++----- > > > 1 file changed, 39 insertions(+), 5 deletions(-) > > > > > > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c > > > index 44dd133594d4..c3140276bb36 100644 > > > --- a/lib/percpu_counter.c > > > +++ b/lib/percpu_counter.c > > > @@ -73,17 +73,50 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount) > > > EXPORT_SYMBOL(percpu_counter_set); > > > > > > /* > > > - * local_irq_save() is needed to make the function irq safe: > > > - * - The slow path would be ok as protected by an irq-safe spinlock. > > > - * - this_cpu_add would be ok as it is irq-safe by definition. > > > - * But: > > > - * The decision slow path/fast path and the actual update must be atomic, too. > > > + * Add to a counter while respecting batch size. > > > + * > > > + * There are 2 implementations, both dealing with the following problem: > > > + * > > > + * The decision slow path/fast path and the actual update must be atomic. > > > * Otherwise a call in process context could check the current values and > > > * decide that the fast path can be used. If now an interrupt occurs before > > > * the this_cpu_add(), and the interrupt updates this_cpu(*fbc->counters), > > > * then the this_cpu_add() that is executed after the interrupt has completed > > > * can produce values larger than "batch" or even overflows. > > > */ > > > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL > > > +/* > > > + * Safety against interrupts is achieved in 2 ways: > > > + * 1. the fast path uses local cmpxchg (note: no lock prefix) > > > + * 2. the slow path operates with interrupts disabled > > > + */ > > > +void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > > > +{ > > > + s64 count; > > > + unsigned long flags; > > > + > > > + count = this_cpu_read(*fbc->counters); > > > > Should this_cpu_read() be inside the do {} while in case the extreme > > case that we get preempted after the read and before the cmpxchg AND > > count + amount < batch on both the previous and next cpu? > > > > this_cpu_try_cmpxchg updates the local value on failure (hence &), so > from semantic pov this is equivalent to having this_cpu_read in the > loop. I'm using it the same way as mod_zone_state. > Ah I didn't catch that last night. Thanks. I've applied this to percpu#for-6.11. Thanks, Dennis > > > + do { > > > + if (unlikely(abs(count + amount)) >= batch) { > > > + raw_spin_lock_irqsave(&fbc->lock, flags); > > > + /* > > > + * Note: by now we might have migrated to another CPU > > > + * or the value might have changed. > > > + */ > > > + count = __this_cpu_read(*fbc->counters); > > > + fbc->count += count + amount; > > > + __this_cpu_sub(*fbc->counters, count); > > > + raw_spin_unlock_irqrestore(&fbc->lock, flags); > > > + return; > > > + } > > > + } while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount)); > > > +} > > > +#else > > > +/* > > > + * local_irq_save() is used to make the function irq safe: > > > + * - The slow path would be ok as protected by an irq-safe spinlock. > > > + * - this_cpu_add would be ok as it is irq-safe by definition. > > > + */ > > > void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > > > { > > > s64 count; > > > @@ -101,6 +134,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > > > } > > > local_irq_restore(flags); > > > } > > > +#endif > > > EXPORT_SYMBOL(percpu_counter_add_batch); > > > > > > /* > > > -- > > > 2.39.2 > > > > > > > Thanks, > > Dennis > > > > -- > Mateusz Guzik <mjguzik gmail.com>
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index 44dd133594d4..c3140276bb36 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -73,17 +73,50 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount) EXPORT_SYMBOL(percpu_counter_set); /* - * local_irq_save() is needed to make the function irq safe: - * - The slow path would be ok as protected by an irq-safe spinlock. - * - this_cpu_add would be ok as it is irq-safe by definition. - * But: - * The decision slow path/fast path and the actual update must be atomic, too. + * Add to a counter while respecting batch size. + * + * There are 2 implementations, both dealing with the following problem: + * + * The decision slow path/fast path and the actual update must be atomic. * Otherwise a call in process context could check the current values and * decide that the fast path can be used. If now an interrupt occurs before * the this_cpu_add(), and the interrupt updates this_cpu(*fbc->counters), * then the this_cpu_add() that is executed after the interrupt has completed * can produce values larger than "batch" or even overflows. */ +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL +/* + * Safety against interrupts is achieved in 2 ways: + * 1. the fast path uses local cmpxchg (note: no lock prefix) + * 2. the slow path operates with interrupts disabled + */ +void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) +{ + s64 count; + unsigned long flags; + + count = this_cpu_read(*fbc->counters); + do { + if (unlikely(abs(count + amount)) >= batch) { + raw_spin_lock_irqsave(&fbc->lock, flags); + /* + * Note: by now we might have migrated to another CPU + * or the value might have changed. + */ + count = __this_cpu_read(*fbc->counters); + fbc->count += count + amount; + __this_cpu_sub(*fbc->counters, count); + raw_spin_unlock_irqrestore(&fbc->lock, flags); + return; + } + } while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount)); +} +#else +/* + * local_irq_save() is used to make the function irq safe: + * - The slow path would be ok as protected by an irq-safe spinlock. + * - this_cpu_add would be ok as it is irq-safe by definition. + */ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) { s64 count; @@ -101,6 +134,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) } local_irq_restore(flags); } +#endif EXPORT_SYMBOL(percpu_counter_add_batch); /*