diff mbox

[rdma-core,07/14] mlx4: Update to use new udma write barriers

Message ID 20170306173139.GA11805@obsidianresearch.com (mailing list archive)
State Superseded
Headers show

Commit Message

Jason Gunthorpe March 6, 2017, 5:31 p.m. UTC
On Mon, Mar 06, 2017 at 04:57:40PM +0200, Yishai Hadas wrote:

> >Since there is no cost on x86-64 to this barrier I would like to leave
> >it here as it lets us actually optimize the ARM and other barriers. If
> >you take it out then 'udma_ordering_write_barrier' is forced to be the
> >strongest barrier on all arches.
> 
> Till we make the further optimizations, we suspect a performance degradation
> in other ARCH(s) rather than X86, as this patch introduce an extra barrier
> which wasn't before (i.e udma_to_device_barrier).

Yes, possibly.

The only other option I see is to change those couple of call sites in
mlx4 to be udma_to_device_barrier() - which looses the information
they are actually doing something different.

Honestly, I think if someone cares about the other arches they will
see a net win if the proper weak barrier is implemented for
udma_ordering_write_barrier

> >Even on x86, it is very questionable to not have the SFENCE in that
> >spot. AFAIK it is not defined to be strongly ordered.
> >
> >mlx5 has the SFENCE here, for instance.
> 
> We made some performance testing with the above change, initial results
> point on degradation of about 3% in the message rate in the above BlueFlame
> path in X86, this is something that we should prevent.
> 
> Based on below analysis it looks as the change to use 'mmio_wc_start()'
> which is mapped to SFENCE in X86 is redundant.

Okay, I think your analysis makes sense, and extending it broadly
means there is a fair amount of over-barriering on x86 in various
places.

For now, there are several places where this WC spinlock pattern is
used, so let us pull it out, here is an example, there are still other
places in the mlx drivers.

What do you think about this approach? Notice that it allows us to see
that mlx5 should be optimized to elide the leading SFENCE as
well. This should speed up mlx5 compared to the original.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Yishai Hadas March 7, 2017, 4:44 p.m. UTC | #1
On 3/6/2017 7:31 PM, Jason Gunthorpe wrote:
> On Mon, Mar 06, 2017 at 04:57:40PM +0200, Yishai Hadas wrote:
>
>>> Since there is no cost on x86-64 to this barrier I would like to leave
>>> it here as it lets us actually optimize the ARM and other barriers. If
>>> you take it out then 'udma_ordering_write_barrier' is forced to be the
>>> strongest barrier on all arches.
>>
>> Till we make the further optimizations, we suspect a performance degradation
>> in other ARCH(s) rather than X86, as this patch introduce an extra barrier
>> which wasn't before (i.e udma_to_device_barrier).
>
> Yes, possibly.
>
> The only other option I see is to change those couple of call sites in
> mlx4 to be udma_to_device_barrier() - which looses the information
> they are actually doing something different.
>
> Honestly, I think if someone cares about the other arches they will
> see a net win if the proper weak barrier is implemented for
> udma_ordering_write_barrier

We can't allow any temporary degradation and rely on some future 
improvements, it must come together and be justified by some performance 
testing.

I'll send some patch that will drop the leading udma_to_device_barrier() 
and replace udma_ordering_write_barrier() to be 
udma_to_device_barrier(), this will be done as part of the other change 
that is expected here, see below.

>>> Even on x86, it is very questionable to not have the SFENCE in that
>>> spot. AFAIK it is not defined to be strongly ordered.
>>>
>>> mlx5 has the SFENCE here, for instance.
>>
>> We made some performance testing with the above change, initial results
>> point on degradation of about 3% in the message rate in the above BlueFlame
>> path in X86, this is something that we should prevent.
>>
>> Based on below analysis it looks as the change to use 'mmio_wc_start()'
>> which is mapped to SFENCE in X86 is redundant.
>
> Okay, I think your analysis makes sense, and extending it broadly
> means there is a fair amount of over-barriering on x86 in various
> places.
>

Correct, that's should be the way to go.

> What do you think about this approach? Notice that it allows us to see
> that mlx5 should be optimized to elide the leading SFENCE as
> well. This should speed up mlx5 compared to the original.

The below patch makes sense, however, need to be fixed in few points, 
see below. I'll fix it and take it in-house to our regression and 
performance systems, once approved will sent it upstream.

> diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
> index 77a4a34576cb69..32f0b3fe78fe7c 100644
> --- a/providers/mlx4/qp.c
> +++ b/providers/mlx4/qp.c
> @@ -481,15 +481,14 @@ out:
>  		 * Make sure that descriptor is written to memory
>  		 * before writing to BlueFlame page.
>  		 */
> -		mmio_wc_start();
> +		mmio_wc_spinlock(&ctx->bf_lock);
>
>  		++qp->sq.head;

Originally wasn't under the BF spinlock, looks as should be out of that 
spin.

> -		pthread_spin_lock(&ctx->bf_lock);
> -
>  		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
>  			     align(size * 16, 64));
> -		mmio_flush_writes();
> +
> +		mmio_wc_spinunlock(&ctx->bf_lock);

We still should be under the spinlock, see below note, we expect here 
only a mmio_flush_writes() so this macro is not needed at all.
>
>  		ctx->bf_offset ^= ctx->bf_buf_size;

You missed here next line which do a second pthread_spin_unlock(), in 
addition this line need to be under the bf_lock.
>
> diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
> index d7087d986ce79f..6187b85219dacc 100644
> --- a/providers/mlx5/qp.c
> +++ b/providers/mlx5/qp.c
> @@ -931,11 +931,11 @@ out:
>
>  		/* Make sure that the doorbell write happens before the memcpy
>  		 * to WC memory below */
> -		mmio_wc_start();
> -
>  		ctx = to_mctx(ibqp->context);
> -		if (bf->need_lock)
> -			mlx5_spin_lock(&bf->lock);
> +		if (bf->need_lock && !mlx5_single_threaded)
> +			mmio_wc_spinlock(&bf->lock->lock);

Should be &bf->lock.lock to compile, here and below.

> +		else
> +			mmio_wc_start();
>
>  		if (!ctx->shut_up_bf && nreq == 1 && bf->uuarn &&
>  		    (inl || ctx->prefer_bf) && size > 1 &&
> @@ -955,10 +955,11 @@ out:
>  		 * the mmio_flush_writes is CPU local, this will result in the HCA seeing
>  		 * doorbell 2, followed by doorbell 1.
>  		 */
> -		mmio_flush_writes();
>  		bf->offset ^= bf->buf_size;
> -		if (bf->need_lock)
> -			mlx5_spin_unlock(&bf->lock);
> +		if (bf->need_lock && !mlx5_single_threaded)
> +			mmio_wc_spinunlock(&bf->lock->lock);
> +		else
> +			mmio_flush_writes();
>  	}
>
>  	mlx5_spin_unlock(&qp->sq.lock);
> diff --git a/util/udma_barrier.h b/util/udma_barrier.h
> index 9e73148af8d5b6..ea4904d28f6a48 100644
> --- a/util/udma_barrier.h
> +++ b/util/udma_barrier.h
> @@ -33,6 +33,8 @@
>  #ifndef __UTIL_UDMA_BARRIER_H
>  #define __UTIL_UDMA_BARRIER_H
>
> +#include <pthread.h>
> +
>  /* Barriers for DMA.
>
>     These barriers are expliclty only for use with user DMA operations. If you
> @@ -222,4 +224,26 @@
>  */
>  #define mmio_ordered_writes_hack() mmio_flush_writes()
>
> +/* Higher Level primitives */
> +
> +/* Do mmio_wc_start and grab a spinlock */
> +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
> +{
> +	pthread_spin_lock(lock);
> +#if !defined(__i386__) && !defined(__x86_64__)
> +	/* For x86 the serialization within the spin lock is enough to
> +	 * strongly order WC and other memory types. */
> +	mmio_wc_start();
> +#endif
> +}
> +
> +static inline void mmio_wc_spinunlock(pthread_spinlock_t *lock)
> +{
> +	/* On x86 the lock is enough for strong ordering, but the SFENCE
> +	 * encourages the WC buffers to flush out more quickly (Yishai:
> +	 * confirm?) */

This macro can't do both and should be dropped, see above.

> +	mmio_flush_writes();
> +	pthread_spin_unlock(lock);
> +}
> +
>  #endif
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
index 77a4a34576cb69..32f0b3fe78fe7c 100644
--- a/providers/mlx4/qp.c
+++ b/providers/mlx4/qp.c
@@ -481,15 +481,14 @@  out:
 		 * Make sure that descriptor is written to memory
 		 * before writing to BlueFlame page.
 		 */
-		mmio_wc_start();
+		mmio_wc_spinlock(&ctx->bf_lock);
 
 		++qp->sq.head;
 
-		pthread_spin_lock(&ctx->bf_lock);
-
 		mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
 			     align(size * 16, 64));
-		mmio_flush_writes();
+
+		mmio_wc_spinunlock(&ctx->bf_lock);
 
 		ctx->bf_offset ^= ctx->bf_buf_size;
 
diff --git a/providers/mlx5/qp.c b/providers/mlx5/qp.c
index d7087d986ce79f..6187b85219dacc 100644
--- a/providers/mlx5/qp.c
+++ b/providers/mlx5/qp.c
@@ -931,11 +931,11 @@  out:
 
 		/* Make sure that the doorbell write happens before the memcpy
 		 * to WC memory below */
-		mmio_wc_start();
-
 		ctx = to_mctx(ibqp->context);
-		if (bf->need_lock)
-			mlx5_spin_lock(&bf->lock);
+		if (bf->need_lock && !mlx5_single_threaded)
+			mmio_wc_spinlock(&bf->lock->lock);
+		else
+			mmio_wc_start();
 
 		if (!ctx->shut_up_bf && nreq == 1 && bf->uuarn &&
 		    (inl || ctx->prefer_bf) && size > 1 &&
@@ -955,10 +955,11 @@  out:
 		 * the mmio_flush_writes is CPU local, this will result in the HCA seeing
 		 * doorbell 2, followed by doorbell 1.
 		 */
-		mmio_flush_writes();
 		bf->offset ^= bf->buf_size;
-		if (bf->need_lock)
-			mlx5_spin_unlock(&bf->lock);
+		if (bf->need_lock && !mlx5_single_threaded)
+			mmio_wc_spinunlock(&bf->lock->lock);
+		else
+			mmio_flush_writes();
 	}
 
 	mlx5_spin_unlock(&qp->sq.lock);
diff --git a/util/udma_barrier.h b/util/udma_barrier.h
index 9e73148af8d5b6..ea4904d28f6a48 100644
--- a/util/udma_barrier.h
+++ b/util/udma_barrier.h
@@ -33,6 +33,8 @@ 
 #ifndef __UTIL_UDMA_BARRIER_H
 #define __UTIL_UDMA_BARRIER_H
 
+#include <pthread.h>
+
 /* Barriers for DMA.
 
    These barriers are expliclty only for use with user DMA operations. If you
@@ -222,4 +224,26 @@ 
 */
 #define mmio_ordered_writes_hack() mmio_flush_writes()
 
+/* Higher Level primitives */
+
+/* Do mmio_wc_start and grab a spinlock */
+static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
+{
+	pthread_spin_lock(lock);
+#if !defined(__i386__) && !defined(__x86_64__)
+	/* For x86 the serialization within the spin lock is enough to
+	 * strongly order WC and other memory types. */
+	mmio_wc_start();
+#endif
+}
+
+static inline void mmio_wc_spinunlock(pthread_spinlock_t *lock)
+{
+	/* On x86 the lock is enough for strong ordering, but the SFENCE
+	 * encourages the WC buffers to flush out more quickly (Yishai:
+	 * confirm?) */
+	mmio_flush_writes();
+	pthread_spin_unlock(lock);
+}
+
 #endif