diff mbox

[rdma-core,07/14] mlx4: Update to use new udma write barriers

Message ID 4dcf0cea-3652-0df2-9d98-74e258e6170a@dev.mellanox.co.il (mailing list archive)
State Superseded
Headers show

Commit Message

Yishai Hadas March 9, 2017, 3:42 p.m. UTC
On 3/8/2017 11:56 PM, Jason Gunthorpe wrote:
> On Wed, Mar 08, 2017 at 11:27:51PM +0200, Yishai Hadas wrote:
>
>> As of that any command that needs the lock must be done before the
>> flush and delays the hardware from see the BF data immediately.
>
> The counter point is that the unlock macro can combine the WC flushing
> barrier with the spinlock atomics, reducing the amount of global
> fencing. If you remove the macro your remove that optimization.

The optimization is done as part of mmio_wc_spinlock() for X86, this 
macro is still used.

>
> Why not do this:
>
> -               mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
> -                            align(size * 16, 64));
> -
> +               tmp_bf_offset = ctx->bf_offset;
>                 ctx->bf_offset ^= ctx->bf_buf_size;

The above 2 commands are still delaying the writing to the NIC comparing 
the original code where it was done in one command after mlx4_bf_copy().

> +               mlx4_bf_copy(ctx->bf_page + tmp_bf_offset, (unsigned long *) ctrl,
> +                            align(size * 16, 64));
>

The candidate mlx4 code will be as follows, similar logic will be in mlx5.

@@ -477,22 +474,18 @@ out:
                 ctrl->owner_opcode |= htonl((qp->sq.head & 0xffff) << 8);

                 ctrl->bf_qpn |= qp->doorbell_qpn;
+               ++qp->sq.head;
                 /*
                  * Make sure that descriptor is written to memory
                  * before writing to BlueFlame page.
                  */
-               mmio_wc_start();
-
-               ++qp->sq.head;
-
-               pthread_spin_lock(&ctx->bf_lock);
+               mmio_wc_spinlock(&ctx->bf_lock);

                 mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned
                              long *) ctrl, align(size * 16, 64));

		mmio_flush_writes();
                 ctx->bf_offset ^= ctx->bf_buf_size;
                 pthread_spin_unlock(&ctx->bf_lock);
         } else if (nreq) {
                 qp->sq.head += nreq;

operations. If you
@@ -222,4 +224,17 @@
  */
  #define mmio_ordered_writes_hack() mmio_flush_writes()

+/* Higher Level primitives */
+
+/* Do mmio_wc_start and grab a spinlock */
+static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
+{
+       pthread_spin_lock(lock);
+#if !defined(__i386__) && !defined(__x86_64__)
+       /* For x86 the serialization within the spin lock is enough to
+        * strongly order WC and other memory types. */
+       mmio_wc_start();
+#endif
+}
+
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jason Gunthorpe March 9, 2017, 5:03 p.m. UTC | #1
On Thu, Mar 09, 2017 at 05:42:19PM +0200, Yishai Hadas wrote:

> >The counter point is that the unlock macro can combine the WC flushing
> >barrier with the spinlock atomics, reducing the amount of global
> >fencing. If you remove the macro your remove that optimization.
> 
> The optimization is done as part of mmio_wc_spinlock() for X86, this macro
> is still used.

I'm talking about optimizing the unlock too.

The x86 definition probably requires the unlock to flush the WC
buffers just like for the lock, so avoiding the unlocking SFENCE
entirely will increase throughput further, but with a bit more latency
till flush.

> >Why not do this:
> >
> >-               mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
> >-                            align(size * 16, 64));
> >-
> >+               tmp_bf_offset = ctx->bf_offset;
> >                ctx->bf_offset ^= ctx->bf_buf_size;
> 
> The above 2 commands are still delaying the writing to the NIC comparing the
> original code where it was done in one command after mlx4_bf_copy().

It so simple, look at the assembly, my version eliminates an extra
load, for instance. So we get once 'cycle' better on throughput and
one cycle worse on latency to SFENCE.

But.. This routine was obviously never written to optimize latency to
SFENCE, eg why isn't '++qp->sq.head;' pushed to after the SFENCE if
that is so important? But again, pushing it after will improve
latency, hurt throughput.

.. and if you are so concerned about latency to SFENCE you should be
super excited about the barrier changes I sent you, as those will
improve that by a few cycles also.

I honestly think you are trying far too much to pointlessly preserve
the exact original code...

If you want to wreck the API like this, I would rather do it supported
by actual numbers. Add some TSC measurements in this area and see what
the different scenarios cost.

IMHO this stuff is already hard enough, having a simple to use,
symmetric API is more valuable.

> +/* Higher Level primitives */

If you are going to send this a patch please include my updated
comment.

> +/* Do mmio_wc_start and grab a spinlock */
> +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
> +{
> +       pthread_spin_lock(lock);
> +#if !defined(__i386__) && !defined(__x86_64__)
> +       /* For x86 the serialization within the spin lock is enough to
> +        * strongly order WC and other memory types. */
> +       mmio_wc_start();
> +#endif

I would like to see the unlock inline still present in the header for
clarity to the reader what the expected pattern is, and a comment in
mlx4/mlx5 indicating they are not using the unlock macro directly to
try and reduce latency to the flush.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yishai Hadas March 13, 2017, 3:17 p.m. UTC | #2
On 3/9/2017 7:03 PM, Jason Gunthorpe wrote:
> I honestly think you are trying far too much to pointlessly preserve
> the exact original code...

At that stage we would like to solve the degradation that was introduced 
in previous series of the barriers. Further improvements will be as some 
incremental patches after making the required performance testing.

> If you are going to send this a patch please include my updated
> comment.

I put some comment in mlx4/5 that points that code is latency oriented 
and flush immediately.

>> +/* Do mmio_wc_start and grab a spinlock */
>> +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
>> +{
>> +       pthread_spin_lock(lock);
>> +#if !defined(__i386__) && !defined(__x86_64__)
>> +       /* For x86 the serialization within the spin lock is enough to
>> +        * strongly order WC and other memory types. */
>> +       mmio_wc_start();
>> +#endif
>
> I would like to see the unlock inline still present in the header for
> clarity to the reader what the expected pattern is, and a comment in
> mlx4/mlx5 indicating they are not using the unlock macro directly to
> try and reduce latency to the flush.

For now this macro is not in use by mlx4 & mlx5 as pointed before.

In addition, it still needs some work to verify whether the unlock is 
fully equal to flushing the data immediately as you also wondered.
The lock macro is used for ordering and as such can be used based on 
Intel docs that were pointed.

In case there will be some provider that needs this optimization it can 
be added after finalizing above work.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/util/udma_barrier.h b/util/udma_barrier.h
index 9e73148..ec14dd3 100644
--- a/util/udma_barrier.h
+++ b/util/udma_barrier.h
@@ -33,6 +33,8 @@ 
  #ifndef __UTIL_UDMA_BARRIER_H
  #define __UTIL_UDMA_BARRIER_H

+#include <pthread.h>
+
  /* Barriers for DMA.

     These barriers are explicitly only for use with user DMA