diff mbox series

[net-next,v2,2/2] ptr_ring: make __ptr_ring_empty() checking more reliable

Message ID 1624591136-6647-3-git-send-email-linyunsheng@huawei.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series add benchmark selftest and optimization for ptr_ring | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 6267 this patch: 6267
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 43 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 6326 this patch: 6326
netdev/header_inline success Link

Commit Message

Yunsheng Lin June 25, 2021, 3:18 a.m. UTC
Currently r->queue[] is cleared after r->consumer_head is moved
forward, which makes the __ptr_ring_empty() checking called in
page_pool_refill_alloc_cache() unreliable if the checking is done
after the r->queue clearing and before the consumer_head moving
forward.

Move the r->queue[] clearing after consumer_head moving forward
to make __ptr_ring_empty() checking more reliable.

As a side effect of above change, a consumer_head checking is
avoided for the likely case, and it has noticeable performance
improvement when it is tested using the ptr_ring_test selftest
added in the previous patch.

Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
to test the case of single thread doing both the enqueuing and
dequeuing:

 arch     unpatched           patched       delta
arm64      4648 ms            4464 ms       +3.9%
 X86       2562 ms            2401 ms       +6.2%

Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
to test the case of one thread doing enqueuing and another thread
doing dequeuing concurrently, also known as single-producer/single-
consumer:

 arch      unpatched             patched         delta
arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
 x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
V2: Add performance data.
---
 include/linux/ptr_ring.h | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

Comments

Michael S. Tsirkin June 25, 2021, 6:32 a.m. UTC | #1
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> Currently r->queue[] is cleared after r->consumer_head is moved
> forward, which makes the __ptr_ring_empty() checking called in
> page_pool_refill_alloc_cache() unreliable if the checking is done
> after the r->queue clearing and before the consumer_head moving
> forward.


Well the documentation for __ptr_ring_empty clearly states is
is not guaranteed to be reliable.

 *
 * NB: This is only safe to call if ring is never resized.
 *
 * However, if some other CPU consumes ring entries at the same time, the value
 * returned is not guaranteed to be correct.
 *
 * In this case - to avoid incorrectly detecting the ring
 * as empty - the CPU consuming the ring entries is responsible
 * for either consuming all ring entries until the ring is empty,
 * or synchronizing with some other CPU and causing it to
 * re-test __ptr_ring_empty and/or consume the ring enteries
 * after the synchronization point.
 *

Is it then the case that page_pool_refill_alloc_cache violates
this requirement? How?

It looks like you are trying to make the guarantee stronger and ensure
no false positives.

If yes please document this as such, update the comment so all
code can be evaluated with the eye towards whether the new stronger
guarantee is maintained. In particular I think I see at least one
issue with this immediately.


> Move the r->queue[] clearing after consumer_head moving forward
> to make __ptr_ring_empty() checking more reliable.
> 
> As a side effect of above change, a consumer_head checking is
> avoided for the likely case, and it has noticeable performance
> improvement when it is tested using the ptr_ring_test selftest
> added in the previous patch.
> 
> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> to test the case of single thread doing both the enqueuing and
> dequeuing:
> 
>  arch     unpatched           patched       delta
> arm64      4648 ms            4464 ms       +3.9%
>  X86       2562 ms            2401 ms       +6.2%
> 
> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> to test the case of one thread doing enqueuing and another thread
> doing dequeuing concurrently, also known as single-producer/single-
> consumer:
> 
>  arch      unpatched             patched         delta
> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> V2: Add performance data.
> ---
>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 808f9d3..db9c282 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
>  	 * to work correctly.
>  	 */
> -	int consumer_head = r->consumer_head;
> -	int head = consumer_head++;
> +	int consumer_head = r->consumer_head + 1;
>  
>  	/* Once we have processed enough entries invalidate them in
>  	 * the ring all at once so producer can reuse their space in the ring.
> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  	 */
>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
>  		     consumer_head >= r->size)) {
> +		int tail = r->consumer_tail;
> +
> +		if (unlikely(consumer_head >= r->size)) {
> +			r->consumer_tail = 0;
> +			WRITE_ONCE(r->consumer_head, 0);
> +		} else {
> +			r->consumer_tail = consumer_head;
> +			WRITE_ONCE(r->consumer_head, consumer_head);
> +		}
> +
>  		/* Zero out entries in the reverse order: this way we touch the
>  		 * cache line that producer might currently be reading the last;
>  		 * producer won't make progress and touch other cache lines
>  		 * besides the first one until we write out all entries.
>  		 */
> -		while (likely(head >= r->consumer_tail))
> -			r->queue[head--] = NULL;
> -		r->consumer_tail = consumer_head;
> -	}
> -	if (unlikely(consumer_head >= r->size)) {
> -		consumer_head = 0;
> -		r->consumer_tail = 0;
> +		while (likely(--consumer_head >= tail))
> +			r->queue[consumer_head] = NULL;
> +
> +		return;


So if now we need this to be reliable then
we also need smp_wmb before writing r->queue[consumer_head],
there could be other gotchas.

>  	}
> +
>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
>  	WRITE_ONCE(r->consumer_head, consumer_head);
>  }
> -- 
> 2.7.4
Michael S. Tsirkin June 25, 2021, 6:39 a.m. UTC | #2
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> Currently r->queue[] is cleared after r->consumer_head is moved
> forward, which makes the __ptr_ring_empty() checking called in
> page_pool_refill_alloc_cache() unreliable if the checking is done
> after the r->queue clearing and before the consumer_head moving
> forward.
> 
> Move the r->queue[] clearing after consumer_head moving forward
> to make __ptr_ring_empty() checking more reliable.
> 
> As a side effect of above change, a consumer_head checking is
> avoided for the likely case, and it has noticeable performance
> improvement when it is tested using the ptr_ring_test selftest
> added in the previous patch.
> 
> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> to test the case of single thread doing both the enqueuing and
> dequeuing:
> 
>  arch     unpatched           patched       delta
> arm64      4648 ms            4464 ms       +3.9%
>  X86       2562 ms            2401 ms       +6.2%
> 
> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> to test the case of one thread doing enqueuing and another thread
> doing dequeuing concurrently, also known as single-producer/single-
> consumer:
> 
>  arch      unpatched             patched         delta
> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%

Nice but it's small - could be a fluke.
How many tests did you run? What is the variance?
Did you try pinning to different CPUs to observe numa effects?
Please use perf or some other modern tool for this kind
of benchmark. Thanks!

> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> V2: Add performance data.
> ---
>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 808f9d3..db9c282 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
>  	 * to work correctly.
>  	 */
> -	int consumer_head = r->consumer_head;
> -	int head = consumer_head++;
> +	int consumer_head = r->consumer_head + 1;
>  
>  	/* Once we have processed enough entries invalidate them in
>  	 * the ring all at once so producer can reuse their space in the ring.
> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  	 */
>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
>  		     consumer_head >= r->size)) {
> +		int tail = r->consumer_tail;
> +
> +		if (unlikely(consumer_head >= r->size)) {
> +			r->consumer_tail = 0;
> +			WRITE_ONCE(r->consumer_head, 0);
> +		} else {
> +			r->consumer_tail = consumer_head;
> +			WRITE_ONCE(r->consumer_head, consumer_head);
> +		}
> +
>  		/* Zero out entries in the reverse order: this way we touch the
>  		 * cache line that producer might currently be reading the last;
>  		 * producer won't make progress and touch other cache lines
>  		 * besides the first one until we write out all entries.
>  		 */
> -		while (likely(head >= r->consumer_tail))
> -			r->queue[head--] = NULL;
> -		r->consumer_tail = consumer_head;
> -	}
> -	if (unlikely(consumer_head >= r->size)) {
> -		consumer_head = 0;
> -		r->consumer_tail = 0;
> +		while (likely(--consumer_head >= tail))
> +			r->queue[consumer_head] = NULL;
> +
> +		return;
>  	}
> +
>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
>  	WRITE_ONCE(r->consumer_head, consumer_head);
>  }
> -- 
> 2.7.4
Yunsheng Lin June 25, 2021, 7:21 a.m. UTC | #3
On 2021/6/25 14:32, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
>> Currently r->queue[] is cleared after r->consumer_head is moved
>> forward, which makes the __ptr_ring_empty() checking called in
>> page_pool_refill_alloc_cache() unreliable if the checking is done
>> after the r->queue clearing and before the consumer_head moving
>> forward.
> 
> 
> Well the documentation for __ptr_ring_empty clearly states is
> is not guaranteed to be reliable.

Yes, this patch does not make __ptr_ring_empty() strictly reliable
without taking the r->consumer_lock, as the disscuission in [1].

1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011

> 
>  *
>  * NB: This is only safe to call if ring is never resized.
>  *
>  * However, if some other CPU consumes ring entries at the same time, the value
>  * returned is not guaranteed to be correct.
>  *
>  * In this case - to avoid incorrectly detecting the ring
>  * as empty - the CPU consuming the ring entries is responsible
>  * for either consuming all ring entries until the ring is empty,
>  * or synchronizing with some other CPU and causing it to
>  * re-test __ptr_ring_empty and/or consume the ring enteries
>  * after the synchronization point.
>  *
> 
> Is it then the case that page_pool_refill_alloc_cache violates
> this requirement? How?

As my understanding:
page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid
taking r->consumer_lock, when the above data race happens, it will
exit out and allocate page from the page allocator instead of reusing
the page in ptr_ring, which *may* not be happening if __ptr_ring_empty()
is more reliable.

> 
> It looks like you are trying to make the guarantee stronger and ensure
> no false positives.
> 
> If yes please document this as such, update the comment so all
> code can be evaluated with the eye towards whether the new stronger
> guarantee is maintained. In particular I think I see at least one
> issue with this immediately.
> 
> 
>> Move the r->queue[] clearing after consumer_head moving forward
>> to make __ptr_ring_empty() checking more reliable.
>>
>> As a side effect of above change, a consumer_head checking is
>> avoided for the likely case, and it has noticeable performance
>> improvement when it is tested using the ptr_ring_test selftest
>> added in the previous patch.
>>
>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
>> to test the case of single thread doing both the enqueuing and
>> dequeuing:
>>
>>  arch     unpatched           patched       delta
>> arm64      4648 ms            4464 ms       +3.9%
>>  X86       2562 ms            2401 ms       +6.2%
>>
>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
>> to test the case of one thread doing enqueuing and another thread
>> doing dequeuing concurrently, also known as single-producer/single-
>> consumer:
>>
>>  arch      unpatched             patched         delta
>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>> V2: Add performance data.
>> ---
>>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
>>  1 file changed, 16 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>> index 808f9d3..db9c282 100644
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
>>  	 * to work correctly.
>>  	 */
>> -	int consumer_head = r->consumer_head;
>> -	int head = consumer_head++;
>> +	int consumer_head = r->consumer_head + 1;
>>  
>>  	/* Once we have processed enough entries invalidate them in
>>  	 * the ring all at once so producer can reuse their space in the ring.
>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>>  	 */
>>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
>>  		     consumer_head >= r->size)) {
>> +		int tail = r->consumer_tail;
>> +
>> +		if (unlikely(consumer_head >= r->size)) {
>> +			r->consumer_tail = 0;
>> +			WRITE_ONCE(r->consumer_head, 0);
>> +		} else {
>> +			r->consumer_tail = consumer_head;
>> +			WRITE_ONCE(r->consumer_head, consumer_head);
>> +		}
>> +
>>  		/* Zero out entries in the reverse order: this way we touch the
>>  		 * cache line that producer might currently be reading the last;
>>  		 * producer won't make progress and touch other cache lines
>>  		 * besides the first one until we write out all entries.
>>  		 */
>> -		while (likely(head >= r->consumer_tail))
>> -			r->queue[head--] = NULL;
>> -		r->consumer_tail = consumer_head;
>> -	}
>> -	if (unlikely(consumer_head >= r->size)) {
>> -		consumer_head = 0;
>> -		r->consumer_tail = 0;
>> +		while (likely(--consumer_head >= tail))
>> +			r->queue[consumer_head] = NULL;
>> +
>> +		return;
> 
> 
> So if now we need this to be reliable then
> we also need smp_wmb before writing r->queue[consumer_head],
> there could be other gotchas.

Yes, This patch does not make it strictly reliable.
T think I could mention that in the commit log?

> 
>>  	}
>> +
>>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
>>  	WRITE_ONCE(r->consumer_head, consumer_head);
>>  }
>> -- 
>> 2.7.4
> 
> 
> .
>
Michael S. Tsirkin June 25, 2021, 7:30 a.m. UTC | #4
On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote:
> On 2021/6/25 14:32, Michael S. Tsirkin wrote:
> > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> >> Currently r->queue[] is cleared after r->consumer_head is moved
> >> forward, which makes the __ptr_ring_empty() checking called in
> >> page_pool_refill_alloc_cache() unreliable if the checking is done
> >> after the r->queue clearing and before the consumer_head moving
> >> forward.
> > 
> > 
> > Well the documentation for __ptr_ring_empty clearly states is
> > is not guaranteed to be reliable.
> 
> Yes, this patch does not make __ptr_ring_empty() strictly reliable
> without taking the r->consumer_lock, as the disscuission in [1].
> 
> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011
> 
> > 
> >  *
> >  * NB: This is only safe to call if ring is never resized.
> >  *
> >  * However, if some other CPU consumes ring entries at the same time, the value
> >  * returned is not guaranteed to be correct.
> >  *
> >  * In this case - to avoid incorrectly detecting the ring
> >  * as empty - the CPU consuming the ring entries is responsible
> >  * for either consuming all ring entries until the ring is empty,
> >  * or synchronizing with some other CPU and causing it to
> >  * re-test __ptr_ring_empty and/or consume the ring enteries
> >  * after the synchronization point.
> >  *
> > 
> > Is it then the case that page_pool_refill_alloc_cache violates
> > this requirement? How?
> 
> As my understanding:
> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid
> taking r->consumer_lock, when the above data race happens, it will
> exit out and allocate page from the page allocator instead of reusing
> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty()
> is more reliable.

Question is how do we know it's more reliable?
It would be nice if we did actually made it more reliable,
as it is we are just shifting races around.


> > 
> > It looks like you are trying to make the guarantee stronger and ensure
> > no false positives.
> > 
> > If yes please document this as such, update the comment so all
> > code can be evaluated with the eye towards whether the new stronger
> > guarantee is maintained. In particular I think I see at least one
> > issue with this immediately.
> > 
> > 
> >> Move the r->queue[] clearing after consumer_head moving forward
> >> to make __ptr_ring_empty() checking more reliable.
> >>
> >> As a side effect of above change, a consumer_head checking is
> >> avoided for the likely case, and it has noticeable performance
> >> improvement when it is tested using the ptr_ring_test selftest
> >> added in the previous patch.
> >>
> >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> >> to test the case of single thread doing both the enqueuing and
> >> dequeuing:
> >>
> >>  arch     unpatched           patched       delta
> >> arm64      4648 ms            4464 ms       +3.9%
> >>  X86       2562 ms            2401 ms       +6.2%
> >>
> >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> >> to test the case of one thread doing enqueuing and another thread
> >> doing dequeuing concurrently, also known as single-producer/single-
> >> consumer:
> >>
> >>  arch      unpatched             patched         delta
> >> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
> >>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> ---
> >> V2: Add performance data.
> >> ---
> >>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
> >>  1 file changed, 16 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> >> index 808f9d3..db9c282 100644
> >> --- a/include/linux/ptr_ring.h
> >> +++ b/include/linux/ptr_ring.h
> >> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
> >>  	 * to work correctly.
> >>  	 */
> >> -	int consumer_head = r->consumer_head;
> >> -	int head = consumer_head++;
> >> +	int consumer_head = r->consumer_head + 1;
> >>  
> >>  	/* Once we have processed enough entries invalidate them in
> >>  	 * the ring all at once so producer can reuse their space in the ring.
> >> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >>  	 */
> >>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
> >>  		     consumer_head >= r->size)) {
> >> +		int tail = r->consumer_tail;
> >> +
> >> +		if (unlikely(consumer_head >= r->size)) {
> >> +			r->consumer_tail = 0;
> >> +			WRITE_ONCE(r->consumer_head, 0);
> >> +		} else {
> >> +			r->consumer_tail = consumer_head;
> >> +			WRITE_ONCE(r->consumer_head, consumer_head);
> >> +		}
> >> +
> >>  		/* Zero out entries in the reverse order: this way we touch the
> >>  		 * cache line that producer might currently be reading the last;
> >>  		 * producer won't make progress and touch other cache lines
> >>  		 * besides the first one until we write out all entries.
> >>  		 */
> >> -		while (likely(head >= r->consumer_tail))
> >> -			r->queue[head--] = NULL;
> >> -		r->consumer_tail = consumer_head;
> >> -	}
> >> -	if (unlikely(consumer_head >= r->size)) {
> >> -		consumer_head = 0;
> >> -		r->consumer_tail = 0;
> >> +		while (likely(--consumer_head >= tail))
> >> +			r->queue[consumer_head] = NULL;
> >> +
> >> +		return;
> > 
> > 
> > So if now we need this to be reliable then
> > we also need smp_wmb before writing r->queue[consumer_head],
> > there could be other gotchas.
> 
> Yes, This patch does not make it strictly reliable.
> T think I could mention that in the commit log?

OK so it's not that it makes it more reliable - this patch simply makes
a possible false positive less likely while making  a false negative
more likely. Our assumption is that a false negative is cheaper then?

How do we know that it is?

And even if we prove the ptr_ring itself is faster now,
how do we know what affects callers in a better way a
false positive or a false negative?

I would rather we worked on actually making it reliable
e.g. if we can guarantee no false positives, that would be
a net win.

> 
> > 
> >>  	}
> >> +
> >>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
> >>  	WRITE_ONCE(r->consumer_head, consumer_head);
> >>  }
> >> -- 
> >> 2.7.4
> > 
> > 
> > .
> >
Yunsheng Lin June 25, 2021, 8:33 a.m. UTC | #5
On 2021/6/25 15:30, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote:
>> On 2021/6/25 14:32, Michael S. Tsirkin wrote:
>>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
>>>> Currently r->queue[] is cleared after r->consumer_head is moved
>>>> forward, which makes the __ptr_ring_empty() checking called in
>>>> page_pool_refill_alloc_cache() unreliable if the checking is done
>>>> after the r->queue clearing and before the consumer_head moving
>>>> forward.
>>>
>>>
>>> Well the documentation for __ptr_ring_empty clearly states is
>>> is not guaranteed to be reliable.
>>
>> Yes, this patch does not make __ptr_ring_empty() strictly reliable
>> without taking the r->consumer_lock, as the disscuission in [1].
>>
>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011
>>
>>>
>>>  *
>>>  * NB: This is only safe to call if ring is never resized.
>>>  *
>>>  * However, if some other CPU consumes ring entries at the same time, the value
>>>  * returned is not guaranteed to be correct.
>>>  *
>>>  * In this case - to avoid incorrectly detecting the ring
>>>  * as empty - the CPU consuming the ring entries is responsible
>>>  * for either consuming all ring entries until the ring is empty,
>>>  * or synchronizing with some other CPU and causing it to
>>>  * re-test __ptr_ring_empty and/or consume the ring enteries
>>>  * after the synchronization point.
>>>  *
>>>
>>> Is it then the case that page_pool_refill_alloc_cache violates
>>> this requirement? How?
>>
>> As my understanding:
>> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid
>> taking r->consumer_lock, when the above data race happens, it will
>> exit out and allocate page from the page allocator instead of reusing
>> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty()
>> is more reliable.
> 
> Question is how do we know it's more reliable?
> It would be nice if we did actually made it more reliable,
> as it is we are just shifting races around.
> 
> 
>>>
>>> It looks like you are trying to make the guarantee stronger and ensure
>>> no false positives.
>>>
>>> If yes please document this as such, update the comment so all
>>> code can be evaluated with the eye towards whether the new stronger
>>> guarantee is maintained. In particular I think I see at least one
>>> issue with this immediately.
>>>
>>>
>>>> Move the r->queue[] clearing after consumer_head moving forward
>>>> to make __ptr_ring_empty() checking more reliable.
>>>>
>>>> As a side effect of above change, a consumer_head checking is
>>>> avoided for the likely case, and it has noticeable performance
>>>> improvement when it is tested using the ptr_ring_test selftest
>>>> added in the previous patch.
>>>>
>>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
>>>> to test the case of single thread doing both the enqueuing and
>>>> dequeuing:
>>>>
>>>>  arch     unpatched           patched       delta
>>>> arm64      4648 ms            4464 ms       +3.9%
>>>>  X86       2562 ms            2401 ms       +6.2%
>>>>
>>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
>>>> to test the case of one thread doing enqueuing and another thread
>>>> doing dequeuing concurrently, also known as single-producer/single-
>>>> consumer:
>>>>
>>>>  arch      unpatched             patched         delta
>>>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>>>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
>>>>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>>> ---
>>>> V2: Add performance data.
>>>> ---
>>>>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
>>>>  1 file changed, 16 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>>>> index 808f9d3..db9c282 100644
>>>> --- a/include/linux/ptr_ring.h
>>>> +++ b/include/linux/ptr_ring.h
>>>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>>>>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
>>>>  	 * to work correctly.
>>>>  	 */
>>>> -	int consumer_head = r->consumer_head;
>>>> -	int head = consumer_head++;
>>>> +	int consumer_head = r->consumer_head + 1;
>>>>  
>>>>  	/* Once we have processed enough entries invalidate them in
>>>>  	 * the ring all at once so producer can reuse their space in the ring.
>>>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>>>>  	 */
>>>>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
>>>>  		     consumer_head >= r->size)) {
>>>> +		int tail = r->consumer_tail;
>>>> +
>>>> +		if (unlikely(consumer_head >= r->size)) {
>>>> +			r->consumer_tail = 0;
>>>> +			WRITE_ONCE(r->consumer_head, 0);
>>>> +		} else {
>>>> +			r->consumer_tail = consumer_head;
>>>> +			WRITE_ONCE(r->consumer_head, consumer_head);
>>>> +		}
>>>> +
>>>>  		/* Zero out entries in the reverse order: this way we touch the
>>>>  		 * cache line that producer might currently be reading the last;
>>>>  		 * producer won't make progress and touch other cache lines
>>>>  		 * besides the first one until we write out all entries.
>>>>  		 */
>>>> -		while (likely(head >= r->consumer_tail))
>>>> -			r->queue[head--] = NULL;
>>>> -		r->consumer_tail = consumer_head;
>>>> -	}
>>>> -	if (unlikely(consumer_head >= r->size)) {
>>>> -		consumer_head = 0;
>>>> -		r->consumer_tail = 0;
>>>> +		while (likely(--consumer_head >= tail))
>>>> +			r->queue[consumer_head] = NULL;
>>>> +
>>>> +		return;
>>>
>>>
>>> So if now we need this to be reliable then
>>> we also need smp_wmb before writing r->queue[consumer_head],
>>> there could be other gotchas.
>>
>> Yes, This patch does not make it strictly reliable.
>> T think I could mention that in the commit log?
> 
> OK so it's not that it makes it more reliable - this patch simply makes
> a possible false positive less likely while making  a false negative
> more likely. Our assumption is that a false negative is cheaper then?
> 
> How do we know that it is?
> 
> And even if we prove the ptr_ring itself is faster now,
> how do we know what affects callers in a better way a
> false positive or a false negative?
> 
> I would rather we worked on actually making it reliable
> e.g. if we can guarantee no false positives, that would be
> a net win.
I thought deeper about the case you mentioned above, it
seems for the above to happen, the consumer_head need to
be rolled back to zero and incremented to the point when
caller of __ptr_ring_empty() is still *not* able to see the
r->queue[] which has been set to NULL in __ptr_ring_discard_one().

It seems smp_wmb() only need to be done once when consumer_head
is rolled back to zero, and maybe that is enough to make sure the
case you mentioned is fixed too?

And the smp_wmb() is only done once in a round of producing/
consuming, so the performance impact should be minimized?(of
course we need to test it too).

> 
>>
>>>
>>>>  	}
>>>> +
>>>>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
>>>>  	WRITE_ONCE(r->consumer_head, consumer_head);
>>>>  }
>>>> -- 
>>>> 2.7.4
>>>
>>>
>>> .
>>>
> 
> 
> .
>
Yunsheng Lin June 25, 2021, 9:20 a.m. UTC | #6
On 2021/6/25 14:39, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
>> Currently r->queue[] is cleared after r->consumer_head is moved
>> forward, which makes the __ptr_ring_empty() checking called in
>> page_pool_refill_alloc_cache() unreliable if the checking is done
>> after the r->queue clearing and before the consumer_head moving
>> forward.
>>
>> Move the r->queue[] clearing after consumer_head moving forward
>> to make __ptr_ring_empty() checking more reliable.
>>
>> As a side effect of above change, a consumer_head checking is
>> avoided for the likely case, and it has noticeable performance
>> improvement when it is tested using the ptr_ring_test selftest
>> added in the previous patch.
>>
>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
>> to test the case of single thread doing both the enqueuing and
>> dequeuing:
>>
>>  arch     unpatched           patched       delta
>> arm64      4648 ms            4464 ms       +3.9%
>>  X86       2562 ms            2401 ms       +6.2%
>>
>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
>> to test the case of one thread doing enqueuing and another thread
>> doing dequeuing concurrently, also known as single-producer/single-
>> consumer:
>>
>>  arch      unpatched             patched         delta
>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> 
> Nice but it's small - could be a fluke.
> How many tests did you run? What is the variance?
> Did you try pinning to different CPUs to observe numa effects?
> Please use perf or some other modern tool for this kind
> of benchmark. Thanks!

The result is quite stable, and retest using perf stat:

---------------unpatched ptr_ring.c begin----------------------------------

perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2385198 us

 Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':

           2385.49 msec task-clock                #    1.000 CPUs utilized
                26      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.024 K/sec
        6202023521      cycles                    #    2.600 GHz
       17424187640      instructions              #    2.81  insn per cycle
   <not supported>      branches
           6506477      branch-misses

       2.385785170 seconds time elapsed

       2.384014000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2383385 us

 Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':

           2383.67 msec task-clock                #    1.000 CPUs utilized
                26      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.024 K/sec
        6197278066      cycles                    #    2.600 GHz
       17424207772      instructions              #    2.81  insn per cycle
   <not supported>      branches
           6495766      branch-misses

       2.383941170 seconds time elapsed

       2.382215000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2390858 us

 Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':

           2391.16 msec task-clock                #    1.000 CPUs utilized
                25      context-switches          #    0.010 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.024 K/sec
        6216704120      cycles                    #    2.600 GHz
       17424243041      instructions              #    2.80  insn per cycle
   <not supported>      branches
           6483886      branch-misses

       2.391420440 seconds time elapsed

       2.389647000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2389810 us

 Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':

           2390.10 msec task-clock                #    1.000 CPUs utilized
                26      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                58      page-faults               #    0.024 K/sec
        6213995715      cycles                    #    2.600 GHz
       17424227499      instructions              #    2.80  insn per cycle
   <not supported>      branches
           6474069      branch-misses

       2.390367070 seconds time elapsed

       2.388644000 seconds user
       0.000000000 seconds sys

---------------unpatched ptr_ring.c end----------------------------------



---------------patched ptr_ring.c begin----------------------------------
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2198894 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2199.18 msec task-clock                #    1.000 CPUs utilized
                23      context-switches          #    0.010 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                56      page-faults               #    0.025 K/sec
        5717671859      cycles                    #    2.600 GHz
       16124164124      instructions              #    2.82  insn per cycle
   <not supported>      branches
           6564829      branch-misses

       2.199445990 seconds time elapsed

       2.197859000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2222337 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2222.63 msec task-clock                #    1.000 CPUs utilized
                23      context-switches          #    0.010 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.026 K/sec
        5778632853      cycles                    #    2.600 GHz
       16124210769      instructions              #    2.79  insn per cycle
   <not supported>      branches
           6603904      branch-misses

       2.222901020 seconds time elapsed

       2.221312000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2251980 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2252.28 msec task-clock                #    1.000 CPUs utilized
                25      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.025 K/sec
        5855668335      cycles                    #    2.600 GHz
       16124310588      instructions              #    2.75  insn per cycle
   <not supported>      branches
           6777279      branch-misses

       2.252543340 seconds time elapsed

       2.250897000 seconds user
       0.000000000 seconds sys


root@(none):~#
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2209415 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2209.70 msec task-clock                #    1.000 CPUs utilized
                24      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                58      page-faults               #    0.026 K/sec
        5745003772      cycles                    #    2.600 GHz
       16124198886      instructions              #    2.81  insn per cycle
   <not supported>      branches
           6508414      branch-misses

       2.209973960 seconds time elapsed

       2.208354000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2211409 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2211.70 msec task-clock                #    1.000 CPUs utilized
                24      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.026 K/sec
        5750136694      cycles                    #    2.600 GHz
       16124176577      instructions              #    2.80  insn per cycle
   <not supported>      branches
           6553023      branch-misses

       2.211968470 seconds time elapsed

       2.210303000 seconds user
       0.000000000 seconds sys

---------------patched ptr_ring.c end----------------------------------

> 
>>
Michael S. Tsirkin June 27, 2021, 6:03 a.m. UTC | #7
On Fri, Jun 25, 2021 at 04:33:40PM +0800, Yunsheng Lin wrote:
> On 2021/6/25 15:30, Michael S. Tsirkin wrote:
> > On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote:
> >> On 2021/6/25 14:32, Michael S. Tsirkin wrote:
> >>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> >>>> Currently r->queue[] is cleared after r->consumer_head is moved
> >>>> forward, which makes the __ptr_ring_empty() checking called in
> >>>> page_pool_refill_alloc_cache() unreliable if the checking is done
> >>>> after the r->queue clearing and before the consumer_head moving
> >>>> forward.
> >>>
> >>>
> >>> Well the documentation for __ptr_ring_empty clearly states is
> >>> is not guaranteed to be reliable.
> >>
> >> Yes, this patch does not make __ptr_ring_empty() strictly reliable
> >> without taking the r->consumer_lock, as the disscuission in [1].
> >>
> >> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011
> >>
> >>>
> >>>  *
> >>>  * NB: This is only safe to call if ring is never resized.
> >>>  *
> >>>  * However, if some other CPU consumes ring entries at the same time, the value
> >>>  * returned is not guaranteed to be correct.
> >>>  *
> >>>  * In this case - to avoid incorrectly detecting the ring
> >>>  * as empty - the CPU consuming the ring entries is responsible
> >>>  * for either consuming all ring entries until the ring is empty,
> >>>  * or synchronizing with some other CPU and causing it to
> >>>  * re-test __ptr_ring_empty and/or consume the ring enteries
> >>>  * after the synchronization point.
> >>>  *
> >>>
> >>> Is it then the case that page_pool_refill_alloc_cache violates
> >>> this requirement? How?
> >>
> >> As my understanding:
> >> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid
> >> taking r->consumer_lock, when the above data race happens, it will
> >> exit out and allocate page from the page allocator instead of reusing
> >> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty()
> >> is more reliable.
> > 
> > Question is how do we know it's more reliable?
> > It would be nice if we did actually made it more reliable,
> > as it is we are just shifting races around.
> > 
> > 
> >>>
> >>> It looks like you are trying to make the guarantee stronger and ensure
> >>> no false positives.
> >>>
> >>> If yes please document this as such, update the comment so all
> >>> code can be evaluated with the eye towards whether the new stronger
> >>> guarantee is maintained. In particular I think I see at least one
> >>> issue with this immediately.
> >>>
> >>>
> >>>> Move the r->queue[] clearing after consumer_head moving forward
> >>>> to make __ptr_ring_empty() checking more reliable.
> >>>>
> >>>> As a side effect of above change, a consumer_head checking is
> >>>> avoided for the likely case, and it has noticeable performance
> >>>> improvement when it is tested using the ptr_ring_test selftest
> >>>> added in the previous patch.
> >>>>
> >>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> >>>> to test the case of single thread doing both the enqueuing and
> >>>> dequeuing:
> >>>>
> >>>>  arch     unpatched           patched       delta
> >>>> arm64      4648 ms            4464 ms       +3.9%
> >>>>  X86       2562 ms            2401 ms       +6.2%
> >>>>
> >>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> >>>> to test the case of one thread doing enqueuing and another thread
> >>>> doing dequeuing concurrently, also known as single-producer/single-
> >>>> consumer:
> >>>>
> >>>>  arch      unpatched             patched         delta
> >>>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
> >>>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> >>>>
> >>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >>>> ---
> >>>> V2: Add performance data.
> >>>> ---
> >>>>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
> >>>>  1 file changed, 16 insertions(+), 9 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> >>>> index 808f9d3..db9c282 100644
> >>>> --- a/include/linux/ptr_ring.h
> >>>> +++ b/include/linux/ptr_ring.h
> >>>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >>>>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
> >>>>  	 * to work correctly.
> >>>>  	 */
> >>>> -	int consumer_head = r->consumer_head;
> >>>> -	int head = consumer_head++;
> >>>> +	int consumer_head = r->consumer_head + 1;
> >>>>  
> >>>>  	/* Once we have processed enough entries invalidate them in
> >>>>  	 * the ring all at once so producer can reuse their space in the ring.
> >>>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >>>>  	 */
> >>>>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
> >>>>  		     consumer_head >= r->size)) {
> >>>> +		int tail = r->consumer_tail;
> >>>> +
> >>>> +		if (unlikely(consumer_head >= r->size)) {
> >>>> +			r->consumer_tail = 0;
> >>>> +			WRITE_ONCE(r->consumer_head, 0);
> >>>> +		} else {
> >>>> +			r->consumer_tail = consumer_head;
> >>>> +			WRITE_ONCE(r->consumer_head, consumer_head);
> >>>> +		}
> >>>> +
> >>>>  		/* Zero out entries in the reverse order: this way we touch the
> >>>>  		 * cache line that producer might currently be reading the last;
> >>>>  		 * producer won't make progress and touch other cache lines
> >>>>  		 * besides the first one until we write out all entries.
> >>>>  		 */
> >>>> -		while (likely(head >= r->consumer_tail))
> >>>> -			r->queue[head--] = NULL;
> >>>> -		r->consumer_tail = consumer_head;
> >>>> -	}
> >>>> -	if (unlikely(consumer_head >= r->size)) {
> >>>> -		consumer_head = 0;
> >>>> -		r->consumer_tail = 0;
> >>>> +		while (likely(--consumer_head >= tail))
> >>>> +			r->queue[consumer_head] = NULL;
> >>>> +
> >>>> +		return;
> >>>
> >>>
> >>> So if now we need this to be reliable then
> >>> we also need smp_wmb before writing r->queue[consumer_head],
> >>> there could be other gotchas.
> >>
> >> Yes, This patch does not make it strictly reliable.
> >> T think I could mention that in the commit log?
> > 
> > OK so it's not that it makes it more reliable - this patch simply makes
> > a possible false positive less likely while making  a false negative
> > more likely. Our assumption is that a false negative is cheaper then?
> > 
> > How do we know that it is?
> > 
> > And even if we prove the ptr_ring itself is faster now,
> > how do we know what affects callers in a better way a
> > false positive or a false negative?
> > 
> > I would rather we worked on actually making it reliable
> > e.g. if we can guarantee no false positives, that would be
> > a net win.
> I thought deeper about the case you mentioned above, it
> seems for the above to happen, the consumer_head need to
> be rolled back to zero and incremented to the point when
> caller of __ptr_ring_empty() is still *not* able to see the
> r->queue[] which has been set to NULL in __ptr_ring_discard_one().
> 
> It seems smp_wmb() only need to be done once when consumer_head
> is rolled back to zero, and maybe that is enough to make sure the
> case you mentioned is fixed too?
> 
> And the smp_wmb() is only done once in a round of producing/
> consuming, so the performance impact should be minimized?(of
> course we need to test it too).


Sorry I don't really understand the question here.
I think I agree it's enough to do one smp_wmb between
the write of r->queue and write of consumer_head
to help guarantee no false positives.
What other code changes are necessary I can't yet say
without more a deeper code review.

> > 
> >>
> >>>
> >>>>  	}
> >>>> +
> >>>>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
> >>>>  	WRITE_ONCE(r->consumer_head, consumer_head);
> >>>>  }
> >>>> -- 
> >>>> 2.7.4
> >>>
> >>>
> >>> .
> >>>
> > 
> > 
> > .
> >
Michael S. Tsirkin June 27, 2021, 6:07 a.m. UTC | #8
On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote:
> On 2021/6/25 14:39, Michael S. Tsirkin wrote:
> > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> >> Currently r->queue[] is cleared after r->consumer_head is moved
> >> forward, which makes the __ptr_ring_empty() checking called in
> >> page_pool_refill_alloc_cache() unreliable if the checking is done
> >> after the r->queue clearing and before the consumer_head moving
> >> forward.
> >>
> >> Move the r->queue[] clearing after consumer_head moving forward
> >> to make __ptr_ring_empty() checking more reliable.
> >>
> >> As a side effect of above change, a consumer_head checking is
> >> avoided for the likely case, and it has noticeable performance
> >> improvement when it is tested using the ptr_ring_test selftest
> >> added in the previous patch.
> >>
> >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> >> to test the case of single thread doing both the enqueuing and
> >> dequeuing:
> >>
> >>  arch     unpatched           patched       delta
> >> arm64      4648 ms            4464 ms       +3.9%
> >>  X86       2562 ms            2401 ms       +6.2%
> >>
> >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> >> to test the case of one thread doing enqueuing and another thread
> >> doing dequeuing concurrently, also known as single-producer/single-
> >> consumer:
> >>
> >>  arch      unpatched             patched         delta
> >> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
> >>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> > 
> > Nice but it's small - could be a fluke.
> > How many tests did you run? What is the variance?
> > Did you try pinning to different CPUs to observe numa effects?
> > Please use perf or some other modern tool for this kind
> > of benchmark. Thanks!
> 
> The result is quite stable, and retest using perf stat:

How stable exactly?  Try with -r so we can find out.

> ---------------unpatched ptr_ring.c begin----------------------------------
> 
> perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2385198 us
> 
>  Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
> 
>            2385.49 msec task-clock                #    1.000 CPUs utilized
>                 26      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.024 K/sec
>         6202023521      cycles                    #    2.600 GHz
>        17424187640      instructions              #    2.81  insn per cycle
>    <not supported>      branches
>            6506477      branch-misses
> 
>        2.385785170 seconds time elapsed
> 
>        2.384014000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2383385 us
> 
>  Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
> 
>            2383.67 msec task-clock                #    1.000 CPUs utilized
>                 26      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.024 K/sec
>         6197278066      cycles                    #    2.600 GHz
>        17424207772      instructions              #    2.81  insn per cycle
>    <not supported>      branches
>            6495766      branch-misses
> 
>        2.383941170 seconds time elapsed
> 
>        2.382215000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2390858 us
> 
>  Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
> 
>            2391.16 msec task-clock                #    1.000 CPUs utilized
>                 25      context-switches          #    0.010 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.024 K/sec
>         6216704120      cycles                    #    2.600 GHz
>        17424243041      instructions              #    2.80  insn per cycle
>    <not supported>      branches
>            6483886      branch-misses
> 
>        2.391420440 seconds time elapsed
> 
>        2.389647000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2389810 us
> 
>  Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
> 
>            2390.10 msec task-clock                #    1.000 CPUs utilized
>                 26      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 58      page-faults               #    0.024 K/sec
>         6213995715      cycles                    #    2.600 GHz
>        17424227499      instructions              #    2.80  insn per cycle
>    <not supported>      branches
>            6474069      branch-misses
> 
>        2.390367070 seconds time elapsed
> 
>        2.388644000 seconds user
>        0.000000000 seconds sys
> 
> ---------------unpatched ptr_ring.c end----------------------------------
> 
> 
> 
> ---------------patched ptr_ring.c begin----------------------------------
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2198894 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2199.18 msec task-clock                #    1.000 CPUs utilized
>                 23      context-switches          #    0.010 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 56      page-faults               #    0.025 K/sec
>         5717671859      cycles                    #    2.600 GHz
>        16124164124      instructions              #    2.82  insn per cycle
>    <not supported>      branches
>            6564829      branch-misses
> 
>        2.199445990 seconds time elapsed
> 
>        2.197859000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2222337 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2222.63 msec task-clock                #    1.000 CPUs utilized
>                 23      context-switches          #    0.010 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.026 K/sec
>         5778632853      cycles                    #    2.600 GHz
>        16124210769      instructions              #    2.79  insn per cycle
>    <not supported>      branches
>            6603904      branch-misses
> 
>        2.222901020 seconds time elapsed
> 
>        2.221312000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2251980 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2252.28 msec task-clock                #    1.000 CPUs utilized
>                 25      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.025 K/sec
>         5855668335      cycles                    #    2.600 GHz
>        16124310588      instructions              #    2.75  insn per cycle
>    <not supported>      branches
>            6777279      branch-misses
> 
>        2.252543340 seconds time elapsed
> 
>        2.250897000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~#
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2209415 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2209.70 msec task-clock                #    1.000 CPUs utilized
>                 24      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 58      page-faults               #    0.026 K/sec
>         5745003772      cycles                    #    2.600 GHz
>        16124198886      instructions              #    2.81  insn per cycle
>    <not supported>      branches
>            6508414      branch-misses
> 
>        2.209973960 seconds time elapsed
> 
>        2.208354000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2211409 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2211.70 msec task-clock                #    1.000 CPUs utilized
>                 24      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.026 K/sec
>         5750136694      cycles                    #    2.600 GHz
>        16124176577      instructions              #    2.80  insn per cycle
>    <not supported>      branches
>            6553023      branch-misses
> 
>        2.211968470 seconds time elapsed
> 
>        2.210303000 seconds user
>        0.000000000 seconds sys
> 
> ---------------patched ptr_ring.c end----------------------------------
> 
> > 
> >>
>
Yunsheng Lin June 28, 2021, 2:11 a.m. UTC | #9
On 2021/6/27 14:07, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote:
>> On 2021/6/25 14:39, Michael S. Tsirkin wrote:
>>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
>>>> Currently r->queue[] is cleared after r->consumer_head is moved
>>>> forward, which makes the __ptr_ring_empty() checking called in
>>>> page_pool_refill_alloc_cache() unreliable if the checking is done
>>>> after the r->queue clearing and before the consumer_head moving
>>>> forward.
>>>>
>>>> Move the r->queue[] clearing after consumer_head moving forward
>>>> to make __ptr_ring_empty() checking more reliable.
>>>>
>>>> As a side effect of above change, a consumer_head checking is
>>>> avoided for the likely case, and it has noticeable performance
>>>> improvement when it is tested using the ptr_ring_test selftest
>>>> added in the previous patch.
>>>>
>>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
>>>> to test the case of single thread doing both the enqueuing and
>>>> dequeuing:
>>>>
>>>>  arch     unpatched           patched       delta
>>>> arm64      4648 ms            4464 ms       +3.9%
>>>>  X86       2562 ms            2401 ms       +6.2%
>>>>
>>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
>>>> to test the case of one thread doing enqueuing and another thread
>>>> doing dequeuing concurrently, also known as single-producer/single-
>>>> consumer:
>>>>
>>>>  arch      unpatched             patched         delta
>>>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>>>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
>>>
>>> Nice but it's small - could be a fluke.
>>> How many tests did you run? What is the variance?
>>> Did you try pinning to different CPUs to observe numa effects?
>>> Please use perf or some other modern tool for this kind
>>> of benchmark. Thanks!
>>
>> The result is quite stable, and retest using perf stat:
> 
> How stable exactly?  Try with -r so we can find out.

Retest with "perf stat -r":

For unpatched one:
Performance counter stats for './ptr_ring_test -s 1000 -m 1 -N 100000000' (100 runs):

           6780.97 msec task-clock                #    2.000 CPUs utilized            ( +-  5.36% )
                73      context-switches          #    0.011 K/sec                    ( +-  5.07% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
                81      page-faults               #    0.012 K/sec                    ( +-  0.76% )
       17629544748      cycles                    #    2.600 GHz                      ( +-  5.36% )
       25496488950      instructions              #    1.45  insn per cycle           ( +-  0.26% )
   <not supported>      branches
          11489031      branch-misses                                                 ( +-  1.69% )

             3.391 +- 0.182 seconds time elapsed  ( +-  5.35% )

For patched one:
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 1 -N 100000000' (100 runs):

           6567.83 msec task-clock                #    2.000 CPUs utilized            ( +-  5.53% )
                71      context-switches          #    0.011 K/sec                    ( +-  5.26% )
                 0      cpu-migrations            #    0.000 K/sec
                82      page-faults               #    0.012 K/sec                    ( +-  0.85% )
       17075489298      cycles                    #    2.600 GHz                      ( +-  5.53% )
       23861051578      instructions              #    1.40  insn per cycle           ( +-  0.07% )
   <not supported>      branches
          10473776      branch-misses                                                 ( +-  0.60% )

             3.284 +- 0.182 seconds time elapsed  ( +-  5.53% )


The result is more stable when using taskset to limit the running cpu, but I suppose
the above data is stable enough to justify the performance improvement?
Yunsheng Lin June 28, 2021, 2:17 a.m. UTC | #10
On 2021/6/27 14:03, Michael S. Tsirkin wrote:
>>>>>
>>>>>
>>>>> So if now we need this to be reliable then
>>>>> we also need smp_wmb before writing r->queue[consumer_head],
>>>>> there could be other gotchas.
>>>>
>>>> Yes, This patch does not make it strictly reliable.
>>>> T think I could mention that in the commit log?
>>>
>>> OK so it's not that it makes it more reliable - this patch simply makes
>>> a possible false positive less likely while making  a false negative
>>> more likely. Our assumption is that a false negative is cheaper then?
>>>
>>> How do we know that it is?
>>>
>>> And even if we prove the ptr_ring itself is faster now,
>>> how do we know what affects callers in a better way a
>>> false positive or a false negative?
>>>
>>> I would rather we worked on actually making it reliable
>>> e.g. if we can guarantee no false positives, that would be
>>> a net win.
>> I thought deeper about the case you mentioned above, it
>> seems for the above to happen, the consumer_head need to
>> be rolled back to zero and incremented to the point when
>> caller of __ptr_ring_empty() is still *not* able to see the
>> r->queue[] which has been set to NULL in __ptr_ring_discard_one().
>>
>> It seems smp_wmb() only need to be done once when consumer_head
>> is rolled back to zero, and maybe that is enough to make sure the
>> case you mentioned is fixed too?
>>
>> And the smp_wmb() is only done once in a round of producing/
>> consuming, so the performance impact should be minimized?(of
>> course we need to test it too).
> 
> 
> Sorry I don't really understand the question here.
> I think I agree it's enough to do one smp_wmb between
> the write of r->queue and write of consumer_head
> to help guarantee no false positives.
> What other code changes are necessary I can't yet say
> without more a deeper code review.
> 

Ok, thanks for the reviewing.
Will add handling the case you mentioned above in V3 if there
is no noticable performanc impact for handling the above case.
diff mbox series

Patch

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 808f9d3..db9c282 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -261,8 +261,7 @@  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
 	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
 	 * to work correctly.
 	 */
-	int consumer_head = r->consumer_head;
-	int head = consumer_head++;
+	int consumer_head = r->consumer_head + 1;
 
 	/* Once we have processed enough entries invalidate them in
 	 * the ring all at once so producer can reuse their space in the ring.
@@ -271,19 +270,27 @@  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
 	 */
 	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
 		     consumer_head >= r->size)) {
+		int tail = r->consumer_tail;
+
+		if (unlikely(consumer_head >= r->size)) {
+			r->consumer_tail = 0;
+			WRITE_ONCE(r->consumer_head, 0);
+		} else {
+			r->consumer_tail = consumer_head;
+			WRITE_ONCE(r->consumer_head, consumer_head);
+		}
+
 		/* Zero out entries in the reverse order: this way we touch the
 		 * cache line that producer might currently be reading the last;
 		 * producer won't make progress and touch other cache lines
 		 * besides the first one until we write out all entries.
 		 */
-		while (likely(head >= r->consumer_tail))
-			r->queue[head--] = NULL;
-		r->consumer_tail = consumer_head;
-	}
-	if (unlikely(consumer_head >= r->size)) {
-		consumer_head = 0;
-		r->consumer_tail = 0;
+		while (likely(--consumer_head >= tail))
+			r->queue[consumer_head] = NULL;
+
+		return;
 	}
+
 	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
 	WRITE_ONCE(r->consumer_head, consumer_head);
 }