Message ID | 1521216991-28706-19-git-send-email-okaya@codeaurora.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
> Code includes wmb() followed by writel(). writel() already has a barrier on > some architectures like arm64. > > This ends up CPU observing two barriers back to back before executing the > register write. > > Since code already has an explicit barrier call, changing writel() to > writel_relaxed(). > > Signed-off-by: Sinan Kaya <okaya@codeaurora.org> NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just writeX(). I was just looking at this with Chelsio developers, and they said the writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to get rid of the extra barrier for all architectures. Also, t4.h:pio_copy() needs to use __raw_writeq() to enable the write combining fastpath for ARM and PowerPC. The code as it stands doesn't achieve any write combining on PowerPC at least. And the writel()s at the end of the ring functions (the non bar2 udb path) needs a mmiowb() afterwards if you're going to use __raw_writeX() there. However that path is only used for very old hardware (T4), so I wouldn't worry about them. Steve. > --- > drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/hw/cxgb4/t4.h > b/drivers/infiniband/hw/cxgb4/t4.h > index 8369c7c..7a48c9e 100644 > --- a/drivers/infiniband/hw/cxgb4/t4.h > +++ b/drivers/infiniband/hw/cxgb4/t4.h > @@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, > u16 inc, union t4_wr *wqe) > (u64 *)wqe); > } else { > pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx); > - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), > - wq->sq.bar2_va + SGE_UDB_KDOORBELL); > + writel_relaxed(PIDX_T5_V(inc) | QID_V(wq- > >sq.bar2_qid), > + wq->sq.bar2_va + > SGE_UDB_KDOORBELL); > } > > /* Flush user doorbell area writes. */ > wmb(); > return; > } > - writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); > + writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); > } > > static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, > @@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, > u16 inc, > (void *)wqe); > } else { > pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx); > - writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid), > - wq->rq.bar2_va + SGE_UDB_KDOORBELL); > + writel_relaxed(PIDX_T5_V(inc) | QID_V(wq- > >rq.bar2_qid), > + wq->rq.bar2_va + > SGE_UDB_KDOORBELL); > } > > /* Flush user doorbell area writes. */ > wmb(); > return; > } > - writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); > + writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); > } > > static inline int t4_wq_in_error(struct t4_wq *wq) > -- > 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/16/2018 5:05 PM, Steve Wise wrote: >> Code includes wmb() followed by writel(). writel() already has a barrier > on >> some architectures like arm64. >> >> This ends up CPU observing two barriers back to back before executing the >> register write. >> >> Since code already has an explicit barrier call, changing writel() to >> writel_relaxed(). >> >> Signed-off-by: Sinan Kaya <okaya@codeaurora.org> > > NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just > writeX(). > > I was just looking at this with Chelsio developers, and they said the > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to > get rid of the extra barrier for all architectures. OK. I can do that but isn't the problem at PowerPC adaptation? /* * We don't do relaxed operations yet, at least not with this semantic */ #define readb_relaxed(addr) readb(addr) #define readw_relaxed(addr) readw(addr) #define readl_relaxed(addr) readl(addr) #define readq_relaxed(addr) readq(addr) #define writeb_relaxed(v, addr) writeb(v, addr) #define writew_relaxed(v, addr) writew(v, addr) #define writel_relaxed(v, addr) writel(v, addr) #define writeq_relaxed(v, addr) writeq(v, addr) Why don't we fix the PowerPC's relaxed operators? Is that a bigger task? From API perspective both __raw_writeX() and writeX_relaxed() are correct. It is just PowerPC doesn't seem the follow the definition yet. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote: > > Code includes wmb() followed by writel(). writel() already has a barrier > on > > some architectures like arm64. > > > > This ends up CPU observing two barriers back to back before executing the > > register write. > > > > Since code already has an explicit barrier call, changing writel() to > > writel_relaxed(). > > > > Signed-off-by: Sinan Kaya <okaya@codeaurora.org> > > NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just > writeX(). ?? Why is changing writex() to writeX() a NAK then? > I was just looking at this with Chelsio developers, and they said the > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to > get rid of the extra barrier for all architectures. That doesn't seem semanticaly sane. __raw_writeX() should not appear in driver code, IMHO. Only the arch code can know what the exact semantics of that accessor are.. If ppc can't use writel_relaxed to optimize then we probably need yet another io accessor semantic defined :( Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote: > > > Code includes wmb() followed by writel(). writel() already has a barrier > > on > > > some architectures like arm64. > > > > > > This ends up CPU observing two barriers back to back before executing > the > > > register write. > > > > > > Since code already has an explicit barrier call, changing writel() to > > > writel_relaxed(). > > > > > > Signed-off-by: Sinan Kaya <okaya@codeaurora.org> > > > > NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just > > writeX(). > > ?? Why is changing writex() to writeX() a NAK then? Because I want it correct for PPC as well. > > > I was just looking at this with Chelsio developers, and they said the > > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to > > get rid of the extra barrier for all architectures. > > That doesn't seem semanticaly sane. > > __raw_writeX() should not appear in driver code, IMHO. Only the arch > code can know what the exact semantics of that accessor are.. > > If ppc can't use writel_relaxed to optimize then we probably need yet > another io accessor semantic defined :( Anybody understand why the PPC implementation of writeX_relaxed() isn't relaxed? Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > On 3/16/2018 5:05 PM, Steve Wise wrote: > >> Code includes wmb() followed by writel(). writel() already has a barrier > > on > >> some architectures like arm64. > >> > >> This ends up CPU observing two barriers back to back before executing > the > >> register write. > >> > >> Since code already has an explicit barrier call, changing writel() to > >> writel_relaxed(). > >> > >> Signed-off-by: Sinan Kaya <okaya@codeaurora.org> > > > > NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just > > writeX(). > > > > I was just looking at this with Chelsio developers, and they said the > > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to > > get rid of the extra barrier for all architectures. > > OK. I can do that but isn't the problem at PowerPC adaptation? > > /* > * We don't do relaxed operations yet, at least not with this semantic > */ > #define readb_relaxed(addr) readb(addr) > #define readw_relaxed(addr) readw(addr) > #define readl_relaxed(addr) readl(addr) > #define readq_relaxed(addr) readq(addr) > #define writeb_relaxed(v, addr) writeb(v, addr) > #define writew_relaxed(v, addr) writew(v, addr) > #define writel_relaxed(v, addr) writel(v, addr) > #define writeq_relaxed(v, addr) writeq(v, addr) > > Why don't we fix the PowerPC's relaxed operators? Is that a bigger task? I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC? > > >From API perspective both __raw_writeX() and writeX_relaxed() are > correct. > It is just PowerPC doesn't seem the follow the definition yet. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/16/2018 7:05 PM, Steve Wise wrote: >> >> On 3/16/2018 5:05 PM, Steve Wise wrote: >>>> Code includes wmb() followed by writel(). writel() already has a barrier >>> on >>>> some architectures like arm64. >>>> >>>> This ends up CPU observing two barriers back to back before executing >> the >>>> register write. >>>> >>>> Since code already has an explicit barrier call, changing writel() to >>>> writel_relaxed(). >>>> >>>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org> >>> >>> NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just >>> writeX(). >>> >>> I was just looking at this with Chelsio developers, and they said the >>> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to >>> get rid of the extra barrier for all architectures. >> >> OK. I can do that but isn't the problem at PowerPC adaptation? >> >> /* >> * We don't do relaxed operations yet, at least not with this semantic >> */ >> #define readb_relaxed(addr) readb(addr) >> #define readw_relaxed(addr) readw(addr) >> #define readl_relaxed(addr) readl(addr) >> #define readq_relaxed(addr) readq(addr) >> #define writeb_relaxed(v, addr) writeb(v, addr) >> #define writew_relaxed(v, addr) writew(v, addr) >> #define writel_relaxed(v, addr) writel(v, addr) >> #define writeq_relaxed(v, addr) writeq(v, addr) >> >> Why don't we fix the PowerPC's relaxed operators? Is that a bigger task? > > I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC? > I found this article: https://lwn.net/Articles/698014/#PowerPC%20Implementation Apparently, this issue was discussed at a conference in 2015. Based on how I read this article, writel() and writel_relaxed() behave exactly the same on PowerPC because the barrier is enforced by the time code is leaving a critical section not during MMIO execution. I also see that __raw_writel() is being heavily used in drivers directory. https://elixir.bootlin.com/linux/latest/ident/__raw_writel I'll change writel_relaxed() with __raw_writel() in the series like you suggested and also look at your other comments. This seems to be the current norm. > >> >> >From API perspective both __raw_writeX() and writeX_relaxed() are >> correct. >> It is just PowerPC doesn't seem the follow the definition yet. > > > >
On 3/16/2018 11:40 PM, Sinan Kaya wrote: > I'll change writel_relaxed() with __raw_writel() in the series like you suggested > and also look at your other comments. I spoke too soon. Now that I realized, code needs to follow one of the following patterns for correctness 1) wmb() writel()/writel_relaxed() or 2) wmb() __raw_wrltel() mmiowb() but definitely not wmb() __raw_wrltel() Since #1 == #2, I'll stick to my current implementation of writel_relaxed() Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb() for correctness. ARM's mmiowb() implementation is empty. So, there is no one size fits all solution with the current state of affairs.
On 3/16/18 6:04 PM, Steve Wise wrote: > Anybody understand why the PPC implementation of writeX_relaxed() isn't > relaxed? You probably should ask that on the linuxppc-dev@lists.ozlabs.org mailing list. I've always wondered why PowerPC has non-standard I/O accessors.
On 3/17/2018 12:03 AM, Sinan Kaya wrote: > On 3/16/2018 11:40 PM, Sinan Kaya wrote: >> I'll change writel_relaxed() with __raw_writel() in the series like you suggested >> and also look at your other comments. > > I spoke too soon. > > Now that I realized, code needs to follow one of the following patterns for correctness > > 1) > wmb() > writel()/writel_relaxed() > > or > > 2) > wmb() > __raw_wrltel() > mmiowb() > > but definitely not > > wmb() > __raw_wrltel() > > Since #1 == #2, I'll stick to my current implementation of writel_relaxed() > > Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb() > for correctness. ARM's mmiowb() implementation is empty. > > So, there is no one size fits all solution with the current state of affairs. > > I think I finally got what you mean. Code seems to have wmb() writel()/writeq() wmb() this can be safely replaced with wmb() __raw_writel()/__raw_writeq() wmb() This will work on all arches. Below is the new version. Let me know if this is OK. +++ b/drivers/infiniband/hw/cxgb4/t4.h @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src) int count = 8; while (count) { - writeq(*src, dst); + __raw_writeq(*src, dst); src++; dst++; count--; @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe) (u64 *)wqe); } else { pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx); - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), - wq->sq.bar2_va + SGE_UDB_KDOORBELL); + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), + wq->sq.bar2_va + SGE_UDB_KDOORBELL); } /* Flush user doorbell area writes. */ wmb(); return; } - writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); + __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); + mmiowmb() } static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, (void *)wqe); } else { pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx); - writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid), - wq->rq.bar2_va + SGE_UDB_KDOORBELL); + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid), + wq->rq.bar2_va + SGE_UDB_KDOORBELL); } /* Flush user doorbell area writes. */ wmb(); return; } - writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); + __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); + mmiowmb(); }
On Fri, Mar 16, 2018 at 11:25 PM, Sinan Kaya <okaya@codeaurora.org> wrote: > @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe) > (u64 *)wqe); > } else { > pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx); > - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), > - wq->sq.bar2_va + SGE_UDB_KDOORBELL); > + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), I think Jason might be right that drivers shouldn't be calling the __raw_write functions. You definitely need to run this by the PPC developers, specifically Ben Herrenschmidt.
> > On 3/17/2018 12:03 AM, Sinan Kaya wrote: > > On 3/16/2018 11:40 PM, Sinan Kaya wrote: > >> I'll change writel_relaxed() with __raw_writel() in the series like you > suggested > >> and also look at your other comments. > > > > I spoke too soon. > > > > Now that I realized, code needs to follow one of the following patterns > for correctness > > > > 1) > > wmb() > > writel()/writel_relaxed() > > > > or > > > > 2) > > wmb() > > __raw_wrltel() > > mmiowb() > > > > but definitely not > > > > wmb() > > __raw_wrltel() > > > > Since #1 == #2, I'll stick to my current implementation of writel_relaxed() > > > > Changing writel() to writel_relaxed() or __raw_writel() isn't enough. > PowerPC needs mmiowb() > > for correctness. ARM's mmiowb() implementation is empty. > > > > So, there is no one size fits all solution with the current state of affairs. > > > > > > I think I finally got what you mean. > > Code seems to have > > wmb() > writel()/writeq() > wmb() > > this can be safely replaced with > > wmb() > __raw_writel()/__raw_writeq() > wmb() > > This will work on all arches. Below is the new version. Let me know if this is > OK. > > +++ b/drivers/infiniband/hw/cxgb4/t4.h > @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 > *src) > int count = 8; > > while (count) { > - writeq(*src, dst); > + __raw_writeq(*src, dst); > src++; > dst++; > count--; > @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, > u16 inc, union t4_wr *wqe) > (u64 *)wqe); > } else { > pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx); > - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), > - wq->sq.bar2_va + SGE_UDB_KDOORBELL); > + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), > + wq->sq.bar2_va + SGE_UDB_KDOORBELL); > } > > /* Flush user doorbell area writes. */ > wmb(); > return; > } > - writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); > + __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); > + mmiowmb() > } > > static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, > @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, > u16 inc, > (void *)wqe); > } else { > pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx); > - writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid), > - wq->rq.bar2_va + SGE_UDB_KDOORBELL); > + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid), > + wq->rq.bar2_va + SGE_UDB_KDOORBELL); > } > > /* Flush user doorbell area writes. */ > wmb(); > return; > } > - writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); > + __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); > + mmiowmb(); > } > > Yes, this is what chelsio recommended to me. Reviewed-by: Steve Wise <swise@opengridcomputing.com> -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Sinan Kaya <okaya@codeaurora.org> Date: Sat, 17 Mar 2018 00:25:14 -0400 > I think I finally got what you mean. > > Code seems to have > > wmb() > writel()/writeq() > wmb() > > this can be safely replaced with > > wmb() > __raw_writel()/__raw_writeq() > wmb() > > This will work on all arches. Below is the new version. Let me know if this is OK. Unfortunately, I think this won't work. At least on sparc, the __raw_*() variants also change the endianness to native endianness. PowerPC does this as well, even documented in a comment :-) /* * Non ordered and non-swapping "raw" accessors */ Only the non-__raw_ variants do the endianness swap to/from little-endian. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote: > On 3/17/2018 12:03 AM, Sinan Kaya wrote: > > On 3/16/2018 11:40 PM, Sinan Kaya wrote: > >> I'll change writel_relaxed() with __raw_writel() in the series like you suggested > >> and also look at your other comments. > > > > I spoke too soon. > > > > Now that I realized, code needs to follow one of the following patterns for correctness > > > > 1) > > wmb() > > writel()/writel_relaxed() > > > > or > > > > 2) > > wmb() > > __raw_wrltel() > > mmiowb() > > > > but definitely not > > > > wmb() > > __raw_wrltel() > > > > Since #1 == #2, I'll stick to my current implementation of writel_relaxed() > > > > Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb() > > for correctness. ARM's mmiowb() implementation is empty. > > > > So, there is no one size fits all solution with the current state of affairs. > > > > > > I think I finally got what you mean. > > Code seems to have > > wmb() > writel()/writeq() > wmb() > > this can be safely replaced with > > wmb() > __raw_writel()/__raw_writeq() > wmb() > > This will work on all arches. Below is the new version. Let me know if this is OK. > > +++ b/drivers/infiniband/hw/cxgb4/t4.h > @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src) > int count = 8; > > while (count) { > - writeq(*src, dst); > + __raw_writeq(*src, dst); > src++; > dst++; > count--; > @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe) > (u64 *)wqe); > } else { > pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx); > - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), > - wq->sq.bar2_va + SGE_UDB_KDOORBELL); > + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), > + wq->sq.bar2_va + SGE_UDB_KDOORBELL); > } > > /* Flush user doorbell area writes. */ > wmb(); > return; > } > - writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); > + __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); > + mmiowmb() > } > > static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, > @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, > (void *)wqe); > } else { > pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx); > - writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid), > - wq->rq.bar2_va + SGE_UDB_KDOORBELL); > + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid), > + wq->rq.bar2_va + SGE_UDB_KDOORBELL); > } > > /* Flush user doorbell area writes. */ > wmb(); > return; > } > - writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); > + __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); > + mmiowmb(); No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy! Your first patch was right, replacing wmb() writel() With wmb(); writel_relaxed() is fine, and helps ARM. The problem with PPC has nothing to do with the writel, it is with the spinlock, and we can't improve it without adding some new infrastructure. Certainly don't hack around the problem by using __raw_writel in multi-arch drivers! In userspace we settled on something like this as a pattern: mmio_wc_spin_lock() writel_wc_mmio() mmio_wc_spin_unlock() Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to fully contain all MMIO access to WC and UC memory. Due to our userspace the pattern is specifically used with MMIO writes to WC BAR memory, but it could be generalized.. This allows all known arches to use the best code in all cases. x86 can even optimize a lot and combine the mmiowmb() into the spin_unlock, apparently. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h index 8369c7c..7a48c9e 100644 --- a/drivers/infiniband/hw/cxgb4/t4.h +++ b/drivers/infiniband/hw/cxgb4/t4.h @@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe) (u64 *)wqe); } else { pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx); - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), - wq->sq.bar2_va + SGE_UDB_KDOORBELL); + writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid), + wq->sq.bar2_va + SGE_UDB_KDOORBELL); } /* Flush user doorbell area writes. */ wmb(); return; } - writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); + writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db); } static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, @@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc, (void *)wqe); } else { pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx); - writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid), - wq->rq.bar2_va + SGE_UDB_KDOORBELL); + writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid), + wq->rq.bar2_va + SGE_UDB_KDOORBELL); } /* Flush user doorbell area writes. */ wmb(); return; } - writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); + writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db); } static inline int t4_wq_in_error(struct t4_wq *wq)
Code includes wmb() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Since code already has an explicit barrier call, changing writel() to writel_relaxed(). Signed-off-by: Sinan Kaya <okaya@codeaurora.org> --- drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)