diff mbox

[v2] nvme-rdma: support devices with queue size < 32

Message ID 1519881025.363156294.1491837154312.JavaMail.zimbra@kalray.eu (mailing list archive)
State Superseded
Headers show

Commit Message

Marta Rybczynska April 10, 2017, 3:12 p.m. UTC
In the case of small NVMe-oF queue size (<32) we may enter
a deadlock caused by the fact that the IB completions aren't sent
waiting for 32 and the send queue will fill up.

The error is seen as (using mlx5):
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12

This patch changes the way the signaling is done so
that it depends on the queue depth now. The magic define has
been removed completely.

Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Signed-off-by: Samuel Jones <sjones@kalray.eu>
---
Changes from v1:
* signal by queue size/2, remove hardcoded 32
* support queue depth of 1

 drivers/nvme/host/rdma.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig April 10, 2017, 3:16 p.m. UTC | #1
> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> +{
> +       int sig_limit;
> +
> +       /* We signal completion every queue depth/2 and also
> +        * handle the case of possible device with queue_depth=1,
> +        * where we would need to signal every message.
> +        */
> +       sig_limit = max(queue->queue_size / 2, 1);
> +       return (++queue->sig_count % sig_limit) == 0;
> +}

I would love if we could do this at the ib_qp level.  I also know
you found a real bug and we want it fixed ASAP, so maybe for now
we should merge this one:

Reviewed-by: Christoph Hellwig <hch@lst.de>

and then do the generic version.  Marta, do you want to tackle the
generic version or should I look into this once I get a little time.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska April 10, 2017, 3:21 p.m. UTC | #2
>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +{
>> +       int sig_limit;
>> +
>> +       /* We signal completion every queue depth/2 and also
>> +        * handle the case of possible device with queue_depth=1,
>> +        * where we would need to signal every message.
>> +        */
>> +       sig_limit = max(queue->queue_size / 2, 1);
>> +       return (++queue->sig_count % sig_limit) == 0;
>> +}
> 
> I would love if we could do this at the ib_qp level.  I also know
> you found a real bug and we want it fixed ASAP, so maybe for now
> we should merge this one:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> and then do the generic version.  Marta, do you want to tackle the
> generic version or should I look into this once I get a little time.

I can look if I can do the same thing at the ib_qp level. It would be
great to have other people feedback on this one before it is merged.
We're testing this one now and we can see what I can do in a week or
so? Fine for you?

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche April 10, 2017, 3:27 p.m. UTC | #3
On Mon, 2017-04-10 at 17:16 +0200, Christoph Hellwig wrote:
> > +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> > +{
> > +       int sig_limit;
> > +
> > +       /* We signal completion every queue depth/2 and also
> > +        * handle the case of possible device with queue_depth=1,
> > +        * where we would need to signal every message.
> > +        */
> > +       sig_limit = max(queue->queue_size / 2, 1);
> > +       return (++queue->sig_count % sig_limit) == 0;
> > +}
> 
> I would love if we could do this at the ib_qp level.

Hello Christoph,

Please keep in mind that some but not all RDMA drivers need a construct like
this.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 10, 2017, 3:31 p.m. UTC | #4
On Mon, Apr 10, 2017 at 03:27:14PM +0000, Bart Van Assche wrote:
> > I would love if we could do this at the ib_qp level.
> 
> Hello Christoph,
> 
> Please keep in mind that some but not all RDMA drivers need a construct like
> this.

Sure, I don't want to force anyone to use it.  But I'd like to have
a generic helper that everyone can use instead of reinventing the wheel.
E.g. at least iser will need the same fix, and many drivers would
benefit from not always signalling completions even if they don't have
the ability yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche April 10, 2017, 3:32 p.m. UTC | #5
On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote:
> In the case of small NVMe-oF queue size (<32) we may enter
> a deadlock caused by the fact that the IB completions aren't sent
> waiting for 32 and the send queue will fill up.
> 
> The error is seen as (using mlx5):
> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
> 
> This patch changes the way the signaling is done so
> that it depends on the queue depth now. The magic define has
> been removed completely.
> 
> Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
> Signed-off-by: Samuel Jones <sjones@kalray.eu>
> ---
> Changes from v1:
> * signal by queue size/2, remove hardcoded 32
> * support queue depth of 1
> 
>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 47a479f..4de1b92 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
>                 nvme_rdma_wr_error(cq, wc, "SEND");
>  }
>  
> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> +{
> +       int sig_limit;
> +
> +       /* We signal completion every queue depth/2 and also
> +        * handle the case of possible device with queue_depth=1,
> +        * where we would need to signal every message.
> +        */
> +       sig_limit = max(queue->queue_size / 2, 1);
> +       return (++queue->sig_count % sig_limit) == 0;
> +}
> +
>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>                 struct ib_send_wr *first, bool flush)
> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * Would have been way to obvious to handle this in hardware or
>          * at least the RDMA stack..
>          *
> -        * This messy and racy code sniplet is copy and pasted from the iSER
> -        * initiator, and the magic '32' comes from there as well.
> -        *
>          * Always signal the flushes. The magic request used for the flush
>          * sequencer is not allocated in our driver's tagset and it's
>          * triggered to be freed by blk_cleanup_queue(). So we need to
> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * embedded in request's payload, is not freed when __ib_process_cq()
>          * calls wr_cqe->done().
>          */
> -       if ((++queue->sig_count % 32) == 0 || flush)
> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>                 wr.send_flags |= IB_SEND_SIGNALED;
>  
>         if (first)

Hello Marta,

The approach of this patch is suboptimal from a performance point of view.
If the number of WRs that have been submitted since the last signaled WR
was submitted would be tracked in a member variable that would allow to
get rid of the (relatively slow) division operation.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska April 11, 2017, 8:52 a.m. UTC | #6
> On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote:
>> In the case of small NVMe-oF queue size (<32) we may enter
>> a deadlock caused by the fact that the IB completions aren't sent
>> waiting for 32 and the send queue will fill up.
>> 
>> The error is seen as (using mlx5):
>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>> 
>> This patch changes the way the signaling is done so
>> that it depends on the queue depth now. The magic define has
>> been removed completely.
>> 
>> Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
>> Signed-off-by: Samuel Jones <sjones@kalray.eu>
>> ---
>> Changes from v1:
>> * signal by queue size/2, remove hardcoded 32
>> * support queue depth of 1
>> 
>>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>>  1 file changed, 13 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 47a479f..4de1b92 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct
>> ib_wc *wc)
>>                 nvme_rdma_wr_error(cq, wc, "SEND");
>>  }
>>  
>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +{
>> +       int sig_limit;
>> +
>> +       /* We signal completion every queue depth/2 and also
>> +        * handle the case of possible device with queue_depth=1,
>> +        * where we would need to signal every message.
>> +        */
>> +       sig_limit = max(queue->queue_size / 2, 1);
>> +       return (++queue->sig_count % sig_limit) == 0;
>> +}
>> +
>>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>>                 struct ib_send_wr *first, bool flush)
>> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>> *queue,
>>          * Would have been way to obvious to handle this in hardware or
>>          * at least the RDMA stack..
>>          *
>> -        * This messy and racy code sniplet is copy and pasted from the iSER
>> -        * initiator, and the magic '32' comes from there as well.
>> -        *
>>          * Always signal the flushes. The magic request used for the flush
>>          * sequencer is not allocated in our driver's tagset and it's
>>          * triggered to be freed by blk_cleanup_queue(). So we need to
>> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>> *queue,
>>          * embedded in request's payload, is not freed when __ib_process_cq()
>>          * calls wr_cqe->done().
>>          */
>> -       if ((++queue->sig_count % 32) == 0 || flush)
>> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>>                 wr.send_flags |= IB_SEND_SIGNALED;
>>  
>>         if (first)
> 
> Hello Marta,
> 
> The approach of this patch is suboptimal from a performance point of view.
> If the number of WRs that have been submitted since the last signaled WR
> was submitted would be tracked in a member variable that would allow to
> get rid of the (relatively slow) division operation.
> 

Hello Bart,
I think that we can remove the division (the modulo sig_limit). The sig_count
is an u8 so it is really a type of variable you propose. It isn't used anywhere
it seems so we can change the way it is used in the snippet to count until the
signaling moment. It will give something like:

static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
{
       int sig_limit;

       /* We signal completion every queue depth/2 and also
        * handle the case of possible device with queue_depth=1,
        * where we would need to signal every message.
        */
       sig_limit = max(queue->queue_size / 2, 1);
       queue->sig_count++;
       if (queue->sig_count < sig_limit)
           return 0;
       queue->sig_count = 0;
       return 1;
}


Do you like it better?

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Max Gurtovoy April 11, 2017, 10:50 a.m. UTC | #7
On 4/11/2017 11:52 AM, Marta Rybczynska wrote:
>> On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote:
>>> In the case of small NVMe-oF queue size (<32) we may enter
>>> a deadlock caused by the fact that the IB completions aren't sent
>>> waiting for 32 and the send queue will fill up.
>>>
>>> The error is seen as (using mlx5):
>>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>>>
>>> This patch changes the way the signaling is done so
>>> that it depends on the queue depth now. The magic define has
>>> been removed completely.
>>>
>>> Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
>>> Signed-off-by: Samuel Jones <sjones@kalray.eu>
>>> ---
>>> Changes from v1:
>>> * signal by queue size/2, remove hardcoded 32
>>> * support queue depth of 1
>>>
>>>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>>>  1 file changed, 13 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index 47a479f..4de1b92 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct
>>> ib_wc *wc)
>>>                 nvme_rdma_wr_error(cq, wc, "SEND");
>>>  }
>>>
>>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>>> +{
>>> +       int sig_limit;
>>> +
>>> +       /* We signal completion every queue depth/2 and also
>>> +        * handle the case of possible device with queue_depth=1,
>>> +        * where we would need to signal every message.
>>> +        */
>>> +       sig_limit = max(queue->queue_size / 2, 1);
>>> +       return (++queue->sig_count % sig_limit) == 0;
>>> +}
>>> +
>>>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>>>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>>>                 struct ib_send_wr *first, bool flush)
>>> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>> *queue,
>>>          * Would have been way to obvious to handle this in hardware or
>>>          * at least the RDMA stack..
>>>          *
>>> -        * This messy and racy code sniplet is copy and pasted from the iSER
>>> -        * initiator, and the magic '32' comes from there as well.
>>> -        *
>>>          * Always signal the flushes. The magic request used for the flush
>>>          * sequencer is not allocated in our driver's tagset and it's
>>>          * triggered to be freed by blk_cleanup_queue(). So we need to
>>> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>> *queue,
>>>          * embedded in request's payload, is not freed when __ib_process_cq()
>>>          * calls wr_cqe->done().
>>>          */
>>> -       if ((++queue->sig_count % 32) == 0 || flush)
>>> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>>>                 wr.send_flags |= IB_SEND_SIGNALED;
>>>
>>>         if (first)
>>
>> Hello Marta,
>>
>> The approach of this patch is suboptimal from a performance point of view.
>> If the number of WRs that have been submitted since the last signaled WR
>> was submitted would be tracked in a member variable that would allow to
>> get rid of the (relatively slow) division operation.
>>
>
> Hello Bart,
> I think that we can remove the division (the modulo sig_limit). The sig_count
> is an u8 so it is really a type of variable you propose. It isn't used anywhere
> it seems so we can change the way it is used in the snippet to count until the
> signaling moment. It will give something like:
>
> static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> {
>        int sig_limit;
>
>        /* We signal completion every queue depth/2 and also
>         * handle the case of possible device with queue_depth=1,
>         * where we would need to signal every message.
>         */
>        sig_limit = max(queue->queue_size / 2, 1);
>        queue->sig_count++;
>        if (queue->sig_count < sig_limit)
>            return 0;
>        queue->sig_count = 0;
>        return 1;
> }
>
>
> Do you like it better?

Hi Marta,
I think that Bart (and I agree) meant to avoid doing the division in the 
fast path. You can set a new variable to rdma_ctrl called sig_limit and 
set once it during initialization stage. Then just do:

if ((++queue->sig_count % queue->ctrl->sig_limit) == 0 || flush)

in nvme_rdma_post_send function.

Also, as you mentioned sig_count is u8 so you need to avoid setting the 
sig_limit to a bigger value than 255. So please take a minimum of 32 (or 
any other suggestions ?? ) and the result of max(queue->queue_size / 2, 1);

thanks,
Max.

>
> Marta
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska April 11, 2017, 11:04 a.m. UTC | #8
> On 4/11/2017 11:52 AM, Marta Rybczynska wrote:
>>> On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote:
>>>> In the case of small NVMe-oF queue size (<32) we may enter
>>>> a deadlock caused by the fact that the IB completions aren't sent
>>>> waiting for 32 and the send queue will fill up.
>>>>
>>>> The error is seen as (using mlx5):
>>>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>>>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>>>>
>>>> This patch changes the way the signaling is done so
>>>> that it depends on the queue depth now. The magic define has
>>>> been removed completely.
>>>>
>>>> Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
>>>> Signed-off-by: Samuel Jones <sjones@kalray.eu>
>>>> ---
>>>> Changes from v1:
>>>> * signal by queue size/2, remove hardcoded 32
>>>> * support queue depth of 1
>>>>
>>>>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>>>>  1 file changed, 13 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index 47a479f..4de1b92 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct
>>>> ib_wc *wc)
>>>>                 nvme_rdma_wr_error(cq, wc, "SEND");
>>>>  }
>>>>
>>>> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>>>> +{
>>>> +       int sig_limit;
>>>> +
>>>> +       /* We signal completion every queue depth/2 and also
>>>> +        * handle the case of possible device with queue_depth=1,
>>>> +        * where we would need to signal every message.
>>>> +        */
>>>> +       sig_limit = max(queue->queue_size / 2, 1);
>>>> +       return (++queue->sig_count % sig_limit) == 0;
>>>> +}
>>>> +
>>>>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>>>>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>>>>                 struct ib_send_wr *first, bool flush)
>>>> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>>> *queue,
>>>>          * Would have been way to obvious to handle this in hardware or
>>>>          * at least the RDMA stack..
>>>>          *
>>>> -        * This messy and racy code sniplet is copy and pasted from the iSER
>>>> -        * initiator, and the magic '32' comes from there as well.
>>>> -        *
>>>>          * Always signal the flushes. The magic request used for the flush
>>>>          * sequencer is not allocated in our driver's tagset and it's
>>>>          * triggered to be freed by blk_cleanup_queue(). So we need to
>>>> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue
>>>> *queue,
>>>>          * embedded in request's payload, is not freed when __ib_process_cq()
>>>>          * calls wr_cqe->done().
>>>>          */
>>>> -       if ((++queue->sig_count % 32) == 0 || flush)
>>>> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>>>>                 wr.send_flags |= IB_SEND_SIGNALED;
>>>>
>>>>         if (first)
>>>
>>> Hello Marta,
>>>
>>> The approach of this patch is suboptimal from a performance point of view.
>>> If the number of WRs that have been submitted since the last signaled WR
>>> was submitted would be tracked in a member variable that would allow to
>>> get rid of the (relatively slow) division operation.
>>>
>>
>> Hello Bart,
>> I think that we can remove the division (the modulo sig_limit). The sig_count
>> is an u8 so it is really a type of variable you propose. It isn't used anywhere
>> it seems so we can change the way it is used in the snippet to count until the
>> signaling moment. It will give something like:
>>
>> static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> {
>>        int sig_limit;
>>
>>        /* We signal completion every queue depth/2 and also
>>         * handle the case of possible device with queue_depth=1,
>>         * where we would need to signal every message.
>>         */
>>        sig_limit = max(queue->queue_size / 2, 1);
>>        queue->sig_count++;
>>        if (queue->sig_count < sig_limit)
>>            return 0;
>>        queue->sig_count = 0;
>>        return 1;
>> }
>>
>>
>> Do you like it better?
> 
> Hi Marta,
> I think that Bart (and I agree) meant to avoid doing the division in the
> fast path. You can set a new variable to rdma_ctrl called sig_limit and
> set once it during initialization stage. Then just do:
> 
> if ((++queue->sig_count % queue->ctrl->sig_limit) == 0 || flush)
> 
> in nvme_rdma_post_send function.
> 
> Also, as you mentioned sig_count is u8 so you need to avoid setting the
> sig_limit to a bigger value than 255. So please take a minimum of 32 (or
> any other suggestions ?? ) and the result of max(queue->queue_size / 2, 1);
> 

Hello Max,
Do you mean the division by 2 or the modulo? I would say that the division
by two should be compiled as a shift operation on most architectures, won't it?

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche April 11, 2017, 3:10 p.m. UTC | #9
On Tue, 2017-04-11 at 10:52 +0200, Marta Rybczynska wrote:
> I think that we can remove the division (the modulo sig_limit). The sig_count
> is an u8 so it is really a type of variable you propose. It isn't used anywhere
> it seems so we can change the way it is used in the snippet to count until the
> signaling moment. It will give something like:
> 
> static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> {
>        int sig_limit;
> 
>        /* We signal completion every queue depth/2 and also
>         * handle the case of possible device with queue_depth=1,
>         * where we would need to signal every message.
>         */
>        sig_limit = max(queue->queue_size / 2, 1);
>        queue->sig_count++;
>        if (queue->sig_count < sig_limit)
>            return 0;
>        queue->sig_count = 0;
>        return 1;
> }

Hello Marta,

Thank you for having addressed my feedback. Although this is not important, I
think it is possible to optimize the above code further as follows:
* Instead of initializing / resetting sig_count to zero, initialize it to sig_limit.
* Instead of incrementing sig_count in nvme_rdma_queue_sig_limit(), decrement it.
* Instead of comparing sig_count against sig_limit, compare it against zero.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig May 22, 2017, 6:51 p.m. UTC | #10
I've hand applies this once with a few formatting tweaks now that
I got the ACK from Sagi.  І'd love to see if someone could figure
out how we could move the improved atomic + ilog version into the
rdma core somehow.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska May 23, 2017, 3:32 p.m. UTC | #11
> I've hand applies this once with a few formatting tweaks now that
> I got the ACK from Sagi.  І'd love to see if someone could figure
> out how we could move the improved atomic + ilog version into the
> rdma core somehow.

I'll submit the atomic+ilog version in nvme next week. I have no
access to the test systems this week.

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska June 5, 2017, 9:47 a.m. UTC | #12
>> I've hand applies this once with a few formatting tweaks now that
>> I got the ACK from Sagi.  І'd love to see if someone could figure
>> out how we could move the improved atomic + ilog version into the
>> rdma core somehow.
> 
> I'll submit the atomic+ilog version in nvme next week. I have no
> access to the test systems this week.
> 

I've just posted the new version, with changed title as the v2 is
merged. The new thread name is:
[PATCH] nvme-rdma: remove race conditions from IB signalling

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 47a479f..4de1b92 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1029,6 +1029,18 @@  static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
                nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
+static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+{
+       int sig_limit;
+
+       /* We signal completion every queue depth/2 and also
+        * handle the case of possible device with queue_depth=1,
+        * where we would need to signal every message.
+        */
+       sig_limit = max(queue->queue_size / 2, 1);
+       return (++queue->sig_count % sig_limit) == 0;
+}
+
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
                struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
                struct ib_send_wr *first, bool flush)
@@ -1056,9 +1068,6 @@  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
         * Would have been way to obvious to handle this in hardware or
         * at least the RDMA stack..
         *
-        * This messy and racy code sniplet is copy and pasted from the iSER
-        * initiator, and the magic '32' comes from there as well.
-        *
         * Always signal the flushes. The magic request used for the flush
         * sequencer is not allocated in our driver's tagset and it's
         * triggered to be freed by blk_cleanup_queue(). So we need to
@@ -1066,7 +1075,7 @@  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
         * embedded in request's payload, is not freed when __ib_process_cq()
         * calls wr_cqe->done().
         */
-       if ((++queue->sig_count % 32) == 0 || flush)
+       if (nvme_rdma_queue_sig_limit(queue) || flush)
                wr.send_flags |= IB_SEND_SIGNALED;
 
        if (first)