diff mbox series

[rdma-core] irdma: Restore full memory barrier for doorbell optimization

Message ID 20210805161134.1234-1-tatyana.e.nikolova@intel.com (mailing list archive)
State Changes Requested
Delegated to: Jason Gunthorpe
Headers show
Series [rdma-core] irdma: Restore full memory barrier for doorbell optimization | expand

Commit Message

Nikolova, Tatyana E Aug. 5, 2021, 4:11 p.m. UTC
During the irdma library upstream submission we agreed to
replace atomic_thread_fence(memory_order_seq_cst) in the irdma
doorbell optimization algorithm with udma_to_device_barrier().
However, further regression testing uncovered cases where in
absence of a full memory barrier, the algorithm incorrectly
skips ringing the doorbell.

There has been a discussion about the necessity of a full
memory barrier for the doorbell optimization in the past:
https://lore.kernel.org/linux-rdma/20170301172920.GA11340@ssaleem-MOBL4.amr.corp.intel.com/

The algorithm decides whether to ring the doorbell based on input
from the shadow memory (hw_tail). If the hw_tail is behind the sq_head,
then the algorithm doesn't ring the doorbell, because it assumes that
the HW is still actively processing WQEs.

The shadow area indicates the last WQE processed by the HW and it is
okay if the shadow area value isn't the most current. However there
can't be a window of time between reading the shadow area and setting
the valid bit for the last WQE posted, because the HW may go idle and
the algorithm won't detect this.

The following two cases illustrate this issue and are identical,
except for ops ordering. The first case is an example of how
the wrong order results in not ringing the doorbell when the
HW has gone idle.

Case 1. Failing case without a full memory barrier

Write a WQE#3

Read shadow (hw tail)

hw tail = WQE#1 (i.e. WQE#1 has been processed),
then the algorithm doesn't ring the doorbell. However, in the window
of time between reading the shadow area and setting the valid bit,
the HW goes idle after processing WQE#2
(the valid bit for WQE#3 was clear when we read the shadow area).

Set valid bit for WQE#3

----------------------------------------------------------------------

Case 2. Passing case with a full memory barrier

Write a WQE#3

Set valid bit for WQE#3

Read shadow (hw tail)

hw tail = WQE#1 (i.e. WQE#1 has been processed),
then the algorithm doesn't ring the doorbell. The HW is active
and is expected to see and process WQE#3 before going idle.

----------------------------------------------------------------------

This patch restores the full memory barrier required for the doorbell
optimization.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
---
 providers/irdma/uk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jason Gunthorpe Aug. 6, 2021, 1:28 a.m. UTC | #1
On Thu, Aug 05, 2021 at 11:11:34AM -0500, Tatyana Nikolova wrote:
> During the irdma library upstream submission we agreed to
> replace atomic_thread_fence(memory_order_seq_cst) in the irdma
> doorbell optimization algorithm with udma_to_device_barrier().
> However, further regression testing uncovered cases where in
> absence of a full memory barrier, the algorithm incorrectly
> skips ringing the doorbell.
> 
> There has been a discussion about the necessity of a full
> memory barrier for the doorbell optimization in the past:
> https://lore.kernel.org/linux-rdma/20170301172920.GA11340@ssaleem-MOBL4.amr.corp.intel.com/
> 
> The algorithm decides whether to ring the doorbell based on input
> from the shadow memory (hw_tail). If the hw_tail is behind the sq_head,
> then the algorithm doesn't ring the doorbell, because it assumes that
> the HW is still actively processing WQEs.
> 
> The shadow area indicates the last WQE processed by the HW and it is
> okay if the shadow area value isn't the most current. However there
> can't be a window of time between reading the shadow area and setting
> the valid bit for the last WQE posted, because the HW may go idle and
> the algorithm won't detect this.
> 
> The following two cases illustrate this issue and are identical,
> except for ops ordering. The first case is an example of how
> the wrong order results in not ringing the doorbell when the
> HW has gone idle.

I can't really understand this explanation. since this seemes to be
about a concurrency problem can you please explain it using a normal
ladder diagram? (eg 1a3402d93c73 ("posix-cpu-timers: Fix rearm racing
against process tick") to pick an example at random)

> diff --git a/providers/irdma/uk.c b/providers/irdma/uk.c
> index c7053c52..d63996db 100644
> +++ b/providers/irdma/uk.c
> @@ -118,7 +118,7 @@ void irdma_uk_qp_post_wr(struct irdma_qp_uk *qp)
>  	__u32 sw_sq_head;
>  
>  	/* valid bit is written and loads completed before reading shadow */
> -	udma_to_device_barrier();
> +	atomic_thread_fence(memory_order_seq_cst);

Because it certainly looks wrong to replace a DMA barrier with
something that is not a DMA barrier.

I'm guessing this problem is that the shadow memory is not locked and
also not using using atomics to control concurrent access it? If so
then the fix is to use atomics for the shadow memory and place the
proper order requirement on the atomic itself.

Jason
Nikolova, Tatyana E Aug. 9, 2021, 8:07 p.m. UTC | #2
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, August 5, 2021 8:29 PM
> To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com>
> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for
> doorbell optimization
> 
> On Thu, Aug 05, 2021 at 11:11:34AM -0500, Tatyana Nikolova wrote:
> > During the irdma library upstream submission we agreed to replace
> > atomic_thread_fence(memory_order_seq_cst) in the irdma doorbell
> > optimization algorithm with udma_to_device_barrier().
> > However, further regression testing uncovered cases where in absence
> > of a full memory barrier, the algorithm incorrectly skips ringing the
> > doorbell.
> >
> > There has been a discussion about the necessity of a full memory
> > barrier for the doorbell optimization in the past:
> > https://lore.kernel.org/linux-rdma/20170301172920.GA11340@ssaleem-
> MOBL
> > 4.amr.corp.intel.com/
> >
> > The algorithm decides whether to ring the doorbell based on input from
> > the shadow memory (hw_tail). If the hw_tail is behind the sq_head,
> > then the algorithm doesn't ring the doorbell, because it assumes that
> > the HW is still actively processing WQEs.
> >
> > The shadow area indicates the last WQE processed by the HW and it is
> > okay if the shadow area value isn't the most current. However there
> > can't be a window of time between reading the shadow area and setting
> > the valid bit for the last WQE posted, because the HW may go idle and
> > the algorithm won't detect this.
> >
> > The following two cases illustrate this issue and are identical,
> > except for ops ordering. The first case is an example of how the wrong
> > order results in not ringing the doorbell when the HW has gone idle.
> 
> I can't really understand this explanation. since this seemes to be about a
> concurrency problem can you please explain it using a normal ladder
> diagram? (eg 1a3402d93c73 ("posix-cpu-timers: Fix rearm racing against
> process tick") to pick an example at random)
>
Hi Jason,

The doorbell optimization algorithm involves the following operations:

1.	Software writing the valid bit in the WQE.
2.	Software reading shadow memory (hw_tail) value.
3.	Software deciding whether or not to ring the doorbell.
4.	Hardware reading that WQE.

Without the MFENCE after step 1, the processor is free to move instructions around - it can move the write of the valid bit later (after software reads the hw_tail value).  

MFENCE ensures the order of 1 and 2. 

MFENCE (Vol. 1 11-12)
"The MFENCE instruction establishes a memory fence for both loads and stores. The processor ensures that no load or store after MFENCE will become globally visible until all loads and stores before MFENCE are globally visible"

If the order of 1 and 2 is not maintained, the algorithm can fail because the hardware may not see the valid bit and the doorbell optimization algorithm does not ring the doorbell. This causes hardware to go idle even though there is work to do.

The failure scenario without the MFENCE is shown in a ladder diagram:

SOFTWARE                        CPU                                  DMA DEVICE

Writes the valid bit
SFENCE
Reads the hw_tail value

                                              <Reorders instructions>
                                              Reads hw_tail value 

Decides not to ring the doorbell.

                                                                                        Reads WQE - <valid bit is not set and hardware goes idle>
                                              Writes valid bit
			   SFENCE
	
The order of 1 and 2 is not maintained which causes hardware to go idle even though there is work to do.

Our doorbell optimization algorithm has always required a full memory barrier because it prevents reordering of stores and loads.
There used to be an MFENCE in the i40iw doorbell algorithm. In a previous discussion, https://lore.kernel.org/linux-rdma/20170306194052.GB31672@obsidianresearch.com/, you suggested the use of atomic_thread_fence(memory_order_seq_cst), which is a C11 barrier, and this did work for us.

Thank you,
Tatyana

 
> > diff --git a/providers/irdma/uk.c b/providers/irdma/uk.c index
> > c7053c52..d63996db 100644
> > +++ b/providers/irdma/uk.c
> > @@ -118,7 +118,7 @@ void irdma_uk_qp_post_wr(struct irdma_qp_uk
> *qp)
> >  	__u32 sw_sq_head;
> >
> >  	/* valid bit is written and loads completed before reading shadow */
> > -	udma_to_device_barrier();
> > +	atomic_thread_fence(memory_order_seq_cst);
> 
> Because it certainly looks wrong to replace a DMA barrier with something
> that is not a DMA barrier.
> 
> I'm guessing this problem is that the shadow memory is not locked and also
> not using using atomics to control concurrent access it? If so then the fix is to
> use atomics for the shadow memory and place the proper order
> requirement on the atomic itself.
> 
> Jason
Jason Gunthorpe Aug. 10, 2021, 11:59 a.m. UTC | #3
On Mon, Aug 09, 2021 at 08:07:32PM +0000, Nikolova, Tatyana E wrote:
> 1.	Software writing the valid bit in the WQE.
> 2.	Software reading shadow memory (hw_tail) value.

You are missing an ordered atomic on this read it looks like

Jason
Nikolova, Tatyana E Aug. 13, 2021, 10:25 p.m. UTC | #4
>> 1.	Software writing the valid bit in the WQE.
>> 2.	Software reading shadow memory (hw_tail) value.

> You are missing an ordered atomic on this read it looks like

Hi Jason,

Why do you think we need atomic ops in this case? We aren't trying to protect from multiple threads but CPU re-ordering of a write and a read.

Thank you,
Tatyana
Jason Gunthorpe Aug. 18, 2021, 4:49 p.m. UTC | #5
On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote:
> >> 1.	Software writing the valid bit in the WQE.
> >> 2.	Software reading shadow memory (hw_tail) value.
> 
> > You are missing an ordered atomic on this read it looks like
> 
> Hi Jason,
> 
> Why do you think we need atomic ops in this case? We aren't trying
> to protect from multiple threads but CPU re-ordering of a write and
> a read.

Which is what the atomics will do.

Barriers are only appropriate when you can't add atomic markers to the
actual data that needs ordering.

Jason
Nikolova, Tatyana E Aug. 19, 2021, 10:01 p.m. UTC | #6
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, August 18, 2021 11:50 AM
> To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com>
> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for
> doorbell optimization
> 
> On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote:
> > >> 1.	Software writing the valid bit in the WQE.
> > >> 2.	Software reading shadow memory (hw_tail) value.
> >
> > > You are missing an ordered atomic on this read it looks like
> >
> > Hi Jason,
> >
> > Why do you think we need atomic ops in this case? We aren't trying to
> > protect from multiple threads but CPU re-ordering of a write and a
> > read.
> 
> Which is what the atomics will do.
> 
> Barriers are only appropriate when you can't add atomic markers to the
> actual data that needs ordering.

Hi Jason,

We aren't sure what you mean by atomic markers. We ran a few experiments with atomics, but none of the barriers we tried smp_mb__{before,after}_atomic(), smp_load_acquire() and smp_store_release() translates to a full memory barrier on X86. 

Could you give us an example?

Thank you,
Tatyana
Jason Gunthorpe Aug. 19, 2021, 10:44 p.m. UTC | #7
On Thu, Aug 19, 2021 at 10:01:50PM +0000, Nikolova, Tatyana E wrote:
> 
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, August 18, 2021 11:50 AM
> > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com>
> > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org
> > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for
> > doorbell optimization
> > 
> > On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote:
> > > >> 1.	Software writing the valid bit in the WQE.
> > > >> 2.	Software reading shadow memory (hw_tail) value.
> > >
> > > > You are missing an ordered atomic on this read it looks like
> > >
> > > Hi Jason,
> > >
> > > Why do you think we need atomic ops in this case? We aren't trying to
> > > protect from multiple threads but CPU re-ordering of a write and a
> > > read.
> > 
> > Which is what the atomics will do.
> > 
> > Barriers are only appropriate when you can't add atomic markers to the
> > actual data that needs ordering.
> 
> Hi Jason,
> 
> We aren't sure what you mean by atomic markers. We ran a few
> experiments with atomics, but none of the barriers we tried
> smp_mb__{before,after}_atomic(), smp_load_acquire() and
> smp_store_release() translates to a full memory barrier on X86.

Huh? Those are kernel primitives, this is a userspace patch.

Userspace follows the C11 atomics memory model.

So I'd expect 

  atomic_store_explicit(tail, memory_order_release)
  atomic_load_explicit(tail, memory_order_acquire)

To be the atomics you need. This will ensure that the read/writes to
valid before the atomics are sequenced correctly, eg no CPU thread can
observe an updated tail without also observing the set valid.

Jason
Nikolova, Tatyana E Sept. 2, 2021, 4:27 p.m. UTC | #8
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, August 19, 2021 5:44 PM
> To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com>
> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for
> doorbell optimization
> 
> On Thu, Aug 19, 2021 at 10:01:50PM +0000, Nikolova, Tatyana E wrote:
> >
> >
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, August 18, 2021 11:50 AM
> > > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com>
> > > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org
> > > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier
> > > for doorbell optimization
> > >
> > > On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote:
> > > > >> 1.	Software writing the valid bit in the WQE.
> > > > >> 2.	Software reading shadow memory (hw_tail) value.
> > > >
> > > > > You are missing an ordered atomic on this read it looks like
> > > >
> > > > Hi Jason,
> > > >
> > > > Why do you think we need atomic ops in this case? We aren't trying
> > > > to protect from multiple threads but CPU re-ordering of a write
> > > > and a read.
> > >
> > > Which is what the atomics will do.
> > >
> > > Barriers are only appropriate when you can't add atomic markers to
> > > the actual data that needs ordering.
> >
> > Hi Jason,
> >
> > We aren't sure what you mean by atomic markers. We ran a few
> > experiments with atomics, but none of the barriers we tried
> > smp_mb__{before,after}_atomic(), smp_load_acquire() and
> > smp_store_release() translates to a full memory barrier on X86.
> 
> Huh? Those are kernel primitives, this is a userspace patch.
> 
> Userspace follows the C11 atomics memory model.
> 
> So I'd expect
> 
>   atomic_store_explicit(tail, memory_order_release)
>   atomic_load_explicit(tail, memory_order_acquire)
> 
> To be the atomics you need. This will ensure that the read/writes to valid
> before the atomics are sequenced correctly, eg no CPU thread can observe
> an updated tail without also observing the set valid.
> 

Hi Jason,

We tried these atomic ops as shown bellow, but they don't fix the issue.

atomic_store_explicit(hdr, memory_order_release) atomic_load_explicit(tail, memory_order_acquire)

In assembly they look like this:

//set_64bit_val(wqe, 24, hdr);
atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr, memory_order_release);
                     2130:       49 89 5f 18             mov    %rbx,0x18(%r15)
/root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:747


/root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:123
        temp = atomic_load_explicit((_Atomic(__u64) *)qp->shadow_area, memory_order_acquire);
    1c32:       15 00 00 28 84          adc    $0x84280000,%eax


However, the following works:
 atomic_store_explicit(hdr, memory_order_seq_cst)

//set_64bit_val(wqe, 24, hdr);
 atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr,  memory_order_seq_cst);
    2130:       49 89 5f 18             mov    %rbx,0x18(%r15)
    2134:       0f ae f0                mfence
/root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:748
 

atomic_load_explicit(tail, memory_order_seq_cst) - same assembly as with memory_order_acquire
 
Thank you,
Tatyana
Jason Gunthorpe Sept. 2, 2021, 5:09 p.m. UTC | #9
On Thu, Sep 02, 2021 at 04:27:37PM +0000, Nikolova, Tatyana E wrote:
> 
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, August 19, 2021 5:44 PM
> > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com>
> > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org
> > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for
> > doorbell optimization
> > 
> > On Thu, Aug 19, 2021 at 10:01:50PM +0000, Nikolova, Tatyana E wrote:
> > >
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, August 18, 2021 11:50 AM
> > > > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com>
> > > > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org
> > > > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier
> > > > for doorbell optimization
> > > >
> > > > On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote:
> > > > > >> 1.	Software writing the valid bit in the WQE.
> > > > > >> 2.	Software reading shadow memory (hw_tail) value.
> > > > >
> > > > > > You are missing an ordered atomic on this read it looks like
> > > > >
> > > > > Hi Jason,
> > > > >
> > > > > Why do you think we need atomic ops in this case? We aren't trying
> > > > > to protect from multiple threads but CPU re-ordering of a write
> > > > > and a read.
> > > >
> > > > Which is what the atomics will do.
> > > >
> > > > Barriers are only appropriate when you can't add atomic markers to
> > > > the actual data that needs ordering.
> > >
> > > Hi Jason,
> > >
> > > We aren't sure what you mean by atomic markers. We ran a few
> > > experiments with atomics, but none of the barriers we tried
> > > smp_mb__{before,after}_atomic(), smp_load_acquire() and
> > > smp_store_release() translates to a full memory barrier on X86.
> > 
> > Huh? Those are kernel primitives, this is a userspace patch.
> > 
> > Userspace follows the C11 atomics memory model.
> > 
> > So I'd expect
> > 
> >   atomic_store_explicit(tail, memory_order_release)
> >   atomic_load_explicit(tail, memory_order_acquire)
> > 
> > To be the atomics you need. This will ensure that the read/writes to valid
> > before the atomics are sequenced correctly, eg no CPU thread can observe
> > an updated tail without also observing the set valid.
> > 
> 
> Hi Jason,
> 
> We tried these atomic ops as shown bellow, but they don't fix the issue.
> 
> atomic_store_explicit(hdr, memory_order_release) atomic_load_explicit(tail, memory_order_acquire)
> 
> In assembly they look like this:
> 
> //set_64bit_val(wqe, 24, hdr);
> atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr, memory_order_release);
>                      2130:       49 89 5f 18             mov    %rbx,0x18(%r15)
> /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:747
> 
> 
> /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:123
>         temp = atomic_load_explicit((_Atomic(__u64) *)qp->shadow_area, memory_order_acquire);
>     1c32:       15 00 00 28 84          adc    $0x84280000,%eax
> 
> 
> However, the following works:
>  atomic_store_explicit(hdr, memory_order_seq_cst)
> 
> //set_64bit_val(wqe, 24, hdr);
>  atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr,  memory_order_seq_cst);
>     2130:       49 89 5f 18             mov    %rbx,0x18(%r15)
>     2134:       0f ae f0                mfence
> /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:748
>  
> 
> atomic_load_explicit(tail, memory_order_seq_cst) - same assembly as with memory_order_acquire

Again, you shouldn't need the mfence unless you are doing something
very special, and so far nothing you've explained reaches to what that
special thing is.

According to the x86 memory model when you do a store release then all
writes made by the current CPU will be observed by the CPU that does
the load acquire. Which matches what you asked for.

The main purpose of the atomic notations on x86 is to force the
compiler to put all load/stores in the correct order when going to
assembly. The x86 architecture takes care of the rest without MFENCE,
and from your prior description it seems the issue was compiler
re-ordering of load/stores around a serializing load/sore.

MFENCE is about preventing future stores from becoming visible until
prior stores are, which you shouldn't need with the simple
acquire/release semantics that you described..

Jason
diff mbox series

Patch

diff --git a/providers/irdma/uk.c b/providers/irdma/uk.c
index c7053c52..d63996db 100644
--- a/providers/irdma/uk.c
+++ b/providers/irdma/uk.c
@@ -118,7 +118,7 @@  void irdma_uk_qp_post_wr(struct irdma_qp_uk *qp)
 	__u32 sw_sq_head;
 
 	/* valid bit is written and loads completed before reading shadow */
-	udma_to_device_barrier();
+	atomic_thread_fence(memory_order_seq_cst);
 
 	/* read the doorbell shadow area */
 	get_64bit_val(qp->shadow_area, 0, &temp);