Message ID | 4dcf0cea-3652-0df2-9d98-74e258e6170a@dev.mellanox.co.il (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Thu, Mar 09, 2017 at 05:42:19PM +0200, Yishai Hadas wrote: > >The counter point is that the unlock macro can combine the WC flushing > >barrier with the spinlock atomics, reducing the amount of global > >fencing. If you remove the macro your remove that optimization. > > The optimization is done as part of mmio_wc_spinlock() for X86, this macro > is still used. I'm talking about optimizing the unlock too. The x86 definition probably requires the unlock to flush the WC buffers just like for the lock, so avoiding the unlocking SFENCE entirely will increase throughput further, but with a bit more latency till flush. > >Why not do this: > > > >- mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl, > >- align(size * 16, 64)); > >- > >+ tmp_bf_offset = ctx->bf_offset; > > ctx->bf_offset ^= ctx->bf_buf_size; > > The above 2 commands are still delaying the writing to the NIC comparing the > original code where it was done in one command after mlx4_bf_copy(). It so simple, look at the assembly, my version eliminates an extra load, for instance. So we get once 'cycle' better on throughput and one cycle worse on latency to SFENCE. But.. This routine was obviously never written to optimize latency to SFENCE, eg why isn't '++qp->sq.head;' pushed to after the SFENCE if that is so important? But again, pushing it after will improve latency, hurt throughput. .. and if you are so concerned about latency to SFENCE you should be super excited about the barrier changes I sent you, as those will improve that by a few cycles also. I honestly think you are trying far too much to pointlessly preserve the exact original code... If you want to wreck the API like this, I would rather do it supported by actual numbers. Add some TSC measurements in this area and see what the different scenarios cost. IMHO this stuff is already hard enough, having a simple to use, symmetric API is more valuable. > +/* Higher Level primitives */ If you are going to send this a patch please include my updated comment. > +/* Do mmio_wc_start and grab a spinlock */ > +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock) > +{ > + pthread_spin_lock(lock); > +#if !defined(__i386__) && !defined(__x86_64__) > + /* For x86 the serialization within the spin lock is enough to > + * strongly order WC and other memory types. */ > + mmio_wc_start(); > +#endif I would like to see the unlock inline still present in the header for clarity to the reader what the expected pattern is, and a comment in mlx4/mlx5 indicating they are not using the unlock macro directly to try and reduce latency to the flush. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/9/2017 7:03 PM, Jason Gunthorpe wrote: > I honestly think you are trying far too much to pointlessly preserve > the exact original code... At that stage we would like to solve the degradation that was introduced in previous series of the barriers. Further improvements will be as some incremental patches after making the required performance testing. > If you are going to send this a patch please include my updated > comment. I put some comment in mlx4/5 that points that code is latency oriented and flush immediately. >> +/* Do mmio_wc_start and grab a spinlock */ >> +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock) >> +{ >> + pthread_spin_lock(lock); >> +#if !defined(__i386__) && !defined(__x86_64__) >> + /* For x86 the serialization within the spin lock is enough to >> + * strongly order WC and other memory types. */ >> + mmio_wc_start(); >> +#endif > > I would like to see the unlock inline still present in the header for > clarity to the reader what the expected pattern is, and a comment in > mlx4/mlx5 indicating they are not using the unlock macro directly to > try and reduce latency to the flush. For now this macro is not in use by mlx4 & mlx5 as pointed before. In addition, it still needs some work to verify whether the unlock is fully equal to flushing the data immediately as you also wondered. The lock macro is used for ordering and as such can be used based on Intel docs that were pointed. In case there will be some provider that needs this optimization it can be added after finalizing above work. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/util/udma_barrier.h b/util/udma_barrier.h index 9e73148..ec14dd3 100644 --- a/util/udma_barrier.h +++ b/util/udma_barrier.h @@ -33,6 +33,8 @@ #ifndef __UTIL_UDMA_BARRIER_H #define __UTIL_UDMA_BARRIER_H +#include <pthread.h> + /* Barriers for DMA. These barriers are explicitly only for use with user DMA