diff mbox

[RFC] nvme-rdma: support devices with queue size < 32

Message ID 1315914765.312051621.1490259849534.JavaMail.zimbra@kalray.eu (mailing list archive)
State RFC
Headers show

Commit Message

Marta Rybczynska March 23, 2017, 9:04 a.m. UTC
In the case of small NVMe-oF queue size (<32) we may enter
a deadlock caused by the fact that the IB completions aren't sent
waiting for 32 and the send queue will fill up.

The error is seen as (using mlx5):
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12

The patch doesn't change the behaviour for remote devices with
larger queues.

Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Signed-off-by: Samuel Jones <sjones@kalray.eu>
---
 drivers/nvme/host/rdma.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig March 23, 2017, 2 p.m. UTC | #1
On Thu, Mar 23, 2017 at 10:04:09AM +0100, Marta Rybczynska wrote:
> In the case of small NVMe-oF queue size (<32) we may enter
> a deadlock caused by the fact that the IB completions aren't sent
> waiting for 32 and the send queue will fill up.
> 
> The error is seen as (using mlx5):
> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
> 
> The patch doesn't change the behaviour for remote devices with
> larger queues.

Thanks, this looks useful.  But wouldn't it be better to do something
like queue_size divided by 2 or 4 to get a better refill latency?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska March 23, 2017, 2:36 p.m. UTC | #2
----- Mail original -----
> On Thu, Mar 23, 2017 at 10:04:09AM +0100, Marta Rybczynska wrote:
>> In the case of small NVMe-oF queue size (<32) we may enter
>> a deadlock caused by the fact that the IB completions aren't sent
>> waiting for 32 and the send queue will fill up.
>> 
>> The error is seen as (using mlx5):
>> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
>> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
>> 
>> The patch doesn't change the behaviour for remote devices with
>> larger queues.
> 
> Thanks, this looks useful.  But wouldn't it be better to do something
> like queue_size divided by 2 or 4 to get a better refill latency?

That's an interesting question. The max number of requests is already at 3 or 4 times
of the queue size because of different message types (see Sam's original
message in 'NVMe RDMA driver: CX4 send queue fills up when nvme queue depth is low').
I guess it would have inflence on configs with bigger latency.

I would like to have Sagi's view on this as he's the one who has changed that
part in the iSER initiator in 6df5a128f0fde6315a44e80b30412997147f5efd

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 28, 2017, 11:09 a.m. UTC | #3
>> Thanks, this looks useful.  But wouldn't it be better to do something
>> like queue_size divided by 2 or 4 to get a better refill latency?
>
> That's an interesting question. The max number of requests is already at 3 or 4 times
> of the queue size because of different message types (see Sam's original
> message in 'NVMe RDMA driver: CX4 send queue fills up when nvme queue depth is low').
> I guess it would have inflence on configs with bigger latency.
>
> I would like to have Sagi's view on this as he's the one who has changed that
> part in the iSER initiator in 6df5a128f0fde6315a44e80b30412997147f5efd

Hi Marta,

This looks good indeed. IIRC both for IB and iWARP we need to signal
at least once every send-queue depth (in practice I remember that some
devices need more than once) so maybe it'll be good to have division by
2.

Maybe it'll be better if we do:

static inline bool queue_sig_limit(struct nvme_rdma_queue *queue)
{
	return (++queue->sig_count % (queue->queue_size / 2)) == 0;
}

And lose the hard-coded 32 entirely. Care to test that?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska March 28, 2017, 11:20 a.m. UTC | #4
----- Mail original -----
>>> Thanks, this looks useful.  But wouldn't it be better to do something
>>> like queue_size divided by 2 or 4 to get a better refill latency?
>>
>> That's an interesting question. The max number of requests is already at 3 or 4
>> times
>> of the queue size because of different message types (see Sam's original
>> message in 'NVMe RDMA driver: CX4 send queue fills up when nvme queue depth is
>> low').
>> I guess it would have inflence on configs with bigger latency.
>>
>> I would like to have Sagi's view on this as he's the one who has changed that
>> part in the iSER initiator in 6df5a128f0fde6315a44e80b30412997147f5efd
> 
> Hi Marta,
> 
> This looks good indeed. IIRC both for IB and iWARP we need to signal
> at least once every send-queue depth (in practice I remember that some
> devices need more than once) so maybe it'll be good to have division by
> 2.
> 
> Maybe it'll be better if we do:
> 
> static inline bool queue_sig_limit(struct nvme_rdma_queue *queue)
> {
>	return (++queue->sig_count % (queue->queue_size / 2)) == 0;
> }
> 
> And lose the hard-coded 32 entirely. Care to test that?

Hello Sigi,
I agree with you, we've found a setup where the signalling every queue
depth is not enough and we're testing the division by two that seems
to work fine till now.

In your version in case of queue length > 32 the notifications would
be sent less often that they are now. I'm wondering if it will have
impact on performance and internal card buffering (it seems that
Mellanox buffers are ~100 elements). Wouldn't it create issues?

I'd like see the magic constant removed. From what I can see we 
need to have something not exceeding send buffer of the card but 
also not lower than the queue depth. What do you think?

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 28, 2017, 11:30 a.m. UTC | #5
>> Maybe it'll be better if we do:
>>
>> static inline bool queue_sig_limit(struct nvme_rdma_queue *queue)
>> {
>> 	return (++queue->sig_count % (queue->queue_size / 2)) == 0;
>> }
>>
>> And lose the hard-coded 32 entirely. Care to test that?
>
> Hello Sigi,
> I agree with you, we've found a setup where the signalling every queue
> depth is not enough and we're testing the division by two that seems
> to work fine till now.
>
> In your version in case of queue length > 32 the notifications would
> be sent less often that they are now. I'm wondering if it will have
> impact on performance and internal card buffering (it seems that
> Mellanox buffers are ~100 elements). Wouldn't it create issues?
>
> I'd like see the magic constant removed. From what I can see we
> need to have something not exceeding send buffer of the card but
> also not lower than the queue depth. What do you think?

I'm not sure what buffering is needed from the device at all in this
case, the device is simply expected to avoid signaling completions.

Mellanox folks, any idea where is this limitation coming from?
Do we need a device capability for it?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska March 29, 2017, 9:36 a.m. UTC | #6
>>> Maybe it'll be better if we do:
>>>
>>> static inline bool queue_sig_limit(struct nvme_rdma_queue *queue)
>>> {
>>> 	return (++queue->sig_count % (queue->queue_size / 2)) == 0;
>>> }
>>>
>>> And lose the hard-coded 32 entirely. Care to test that?
>>
>> Hello Sigi,
>> I agree with you, we've found a setup where the signalling every queue
>> depth is not enough and we're testing the division by two that seems
>> to work fine till now.
>>
>> In your version in case of queue length > 32 the notifications would
>> be sent less often that they are now. I'm wondering if it will have
>> impact on performance and internal card buffering (it seems that
>> Mellanox buffers are ~100 elements). Wouldn't it create issues?
>>
>> I'd like see the magic constant removed. From what I can see we
>> need to have something not exceeding send buffer of the card but
>> also not lower than the queue depth. What do you think?
> 
> I'm not sure what buffering is needed from the device at all in this
> case, the device is simply expected to avoid signaling completions.
> 
> Mellanox folks, any idea where is this limitation coming from?
> Do we need a device capability for it?

In the case of mlx5 the we're getting -ENOMEM from begin_wqe
(condition on mlx5_wq_overflow). This queue is sized in the driver
based on multiple factors. If we ack less often, this could
happen for higher queue depths too, I think.

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 29, 2017, 1:29 p.m. UTC | #7
On Tue, Mar 28, 2017 at 02:30:14PM +0300, Sagi Grimberg wrote:
> I'm not sure what buffering is needed from the device at all in this
> case, the device is simply expected to avoid signaling completions.
> 
> Mellanox folks, any idea where is this limitation coming from?
> Do we need a device capability for it?

Fundamentally you must drive SQ flow control via CQ completions. For
instance a ULP cannot disable all CQ notifications and keep
stuffing things into the SQ.

An alternative way to state this: A ULP cannot use activity on the
RQ to infer that there is space in the SQ. Only CQ completions can be
used to prove there is more available SQ space. Do not post to the SQ
until a CQ has been polled proving available space.

Ultimately you need a minimum of one CQ notification for every SQ
depth post and the ULP must not post to the SQ once it fills until
it sees the CQ notification. That usually drives the rule of thumb to
notify every 1/2 depth, however any SQWE posting failures indicate a
ULP bug..

There are a bunch of varied reasons for this, and it was discussed to
death for NFS. NFS's bugs and wonkyness in this area went away when
Chuck did strict accounting SQ capicty driven by the CQ...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 29, 2017, 3:47 p.m. UTC | #8
Jason,

> Fundamentally you must drive SQ flow control via CQ completions. For
> instance a ULP cannot disable all CQ notifications and keep
> stuffing things into the SQ.
>
> An alternative way to state this: A ULP cannot use activity on the
> RQ to infer that there is space in the SQ. Only CQ completions can be
> used to prove there is more available SQ space. Do not post to the SQ
> until a CQ has been polled proving available space.

The recv queue is out of the ball game here...

We just selectively signal send completions only to reduce some
interrupts and completion processing.

> Ultimately you need a minimum of one CQ notification for every SQ
> depth post and the ULP must not post to the SQ once it fills until
> it sees the CQ notification. That usually drives the rule of thumb to
> notify every 1/2 depth, however any SQWE posting failures indicate a
> ULP bug..

Well, usually, but I'm not convinced this is the case.

For each I/O we post up to 2 work requests, 1 for memory registration
and 1 for sending an I/O request (and 1 for local invalidation if the
target doesn't do it for us, but that is not the case here). So if our
queue depth is X, we size our completion queue to be X*3, and we need
to make sure we signal every (X*3)/2.

What I think Marta is seeing, is that no matter how long we size our
send-queue (in other words no matter how high X is) we're not able
to get away with suppress send completion signaling for more than
~100 posts.

Please correct me if I'm wrong Marta?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 29, 2017, 4:27 p.m. UTC | #9
On Wed, Mar 29, 2017 at 06:47:54PM +0300, Sagi Grimberg wrote:

> For each I/O we post up to 2 work requests, 1 for memory registration
> and 1 for sending an I/O request (and 1 for local invalidation if the
> target doesn't do it for us, but that is not the case here). So if our
> queue depth is X, we size our completion queue to be X*3, and we need
> to make sure we signal every (X*3)/2.

??? If your SQ is X and your CQ is X*3 you need to signal at X/2.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 29, 2017, 4:39 p.m. UTC | #10
>> For each I/O we post up to 2 work requests, 1 for memory registration
>> and 1 for sending an I/O request (and 1 for local invalidation if the
>> target doesn't do it for us, but that is not the case here). So if our
>> queue depth is X, we size our completion queue to be X*3, and we need
>> to make sure we signal every (X*3)/2.
>
> ??? If your SQ is X and your CQ is X*3 you need to signal at X/2.

Sorry, I confused SQ with CQ (which made it even more confusing..)

Our application queue-depth is X, we size our SQ to be X*3
(send+reg+inv), we size our RQ to be X (resp) and our CQ to be
X*4 (SQ+RQ).

So we should signal every (X*3)/2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford March 29, 2017, 4:44 p.m. UTC | #11
On 3/29/17 11:39 AM, Sagi Grimberg wrote:
>
>>> For each I/O we post up to 2 work requests, 1 for memory registration
>>> and 1 for sending an I/O request (and 1 for local invalidation if the
>>> target doesn't do it for us, but that is not the case here). So if our
>>> queue depth is X, we size our completion queue to be X*3, and we need
>>> to make sure we signal every (X*3)/2.
>>
>> ??? If your SQ is X and your CQ is X*3 you need to signal at X/2.
>
> Sorry, I confused SQ with CQ (which made it even more confusing..)
>
> Our application queue-depth is X, we size our SQ to be X*3
> (send+reg+inv), we size our RQ to be X (resp) and our CQ to be
> X*4 (SQ+RQ).
>
> So we should signal every (X*3)/2

You say above "we post *up to* 2 work requests", unless you wish to 
change that to "we always post at least 2 work requests per queue 
entry", Jason is right, your frequency of signaling needs to be X/2 
regardless of your CQ size, you need the signaling to control the queue 
depth tracking.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford March 29, 2017, 4:47 p.m. UTC | #12
On 3/29/17 11:44 AM, Doug Ledford wrote:
> On 3/29/17 11:39 AM, Sagi Grimberg wrote:
>>
>>>> For each I/O we post up to 2 work requests, 1 for memory registration
>>>> and 1 for sending an I/O request (and 1 for local invalidation if the
>>>> target doesn't do it for us, but that is not the case here). So if our
>>>> queue depth is X, we size our completion queue to be X*3, and we need
>>>> to make sure we signal every (X*3)/2.
>>>
>>> ??? If your SQ is X and your CQ is X*3 you need to signal at X/2.
>>
>> Sorry, I confused SQ with CQ (which made it even more confusing..)
>>
>> Our application queue-depth is X, we size our SQ to be X*3
>> (send+reg+inv), we size our RQ to be X (resp) and our CQ to be
>> X*4 (SQ+RQ).
>>
>> So we should signal every (X*3)/2
>
> You say above "we post *up to* 2 work requests", unless you wish to
> change that to "we always post at least 2 work requests per queue
> entry", Jason is right, your frequency of signaling needs to be X/2
> regardless of your CQ size, you need the signaling to control the queue
> depth tracking.

If you would like to spread things out farther between signaling, then 
you can modify your send routine to only increment the send counter for 
actual send requests, ignoring registration WQEs and invalidate WQES, 
and then signal every X/2 sends.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 29, 2017, 4:59 p.m. UTC | #13
>> You say above "we post *up to* 2 work requests", unless you wish to
>> change that to "we always post at least 2 work requests per queue
>> entry", Jason is right, your frequency of signaling needs to be X/2
>> regardless of your CQ size, you need the signaling to control the queue
>> depth tracking.
>
> If you would like to spread things out farther between signaling, then
> you can modify your send routine to only increment the send counter for
> actual send requests, ignoring registration WQEs and invalidate WQES,
> and then signal every X/2 sends.

Yea, you're right, and not only I got it wrong, I even contradicted my
own suggestion that was exactly what you and Jason suggested (where is
the nearest rat-hole...)

So I suggested to signal every X/2 and Marta reported SQ overflows for
high queue-dpeth. Marta, at what queue-depth have you seen this?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 29, 2017, 10:19 p.m. UTC | #14
On Wed, Mar 29, 2017 at 07:59:13PM +0300, Sagi Grimberg wrote:
> Yea, you're right, and not only I got it wrong, I even contradicted my
> own suggestion that was exactly what you and Jason suggested (where is
> the nearest rat-hole...)
> 
> So I suggested to signal every X/2 and Marta reported SQ overflows for
> high queue-dpeth. Marta, at what queue-depth have you seen this?

I just want to clarify my comment about RQ - because you are talking
about a queue depth concept.

It is tempting to rely on a queue depth of X == SQ depth of X (or 2*
in this case) and then try to re-use the flow control on the queue to
protect the SQ from overflow.

Generally this does not work because the main queue can retire work
from either the CQ or the RQ. If the retire workload is mainly RQ
based then you can submit to the SQ without a CQ poll and overflow
it.

Generally I expect to see direct measurement of SQ capacity and plug
the main queue when the SQ goes full.

To check this use a simple assertion, decrement a counter on every
post, increment it appropriately on every polled CQ, init to the SQ
depth, and assert the counter never goes negative.

Triggering the assertion is an unconditional ULP bug, and is the most
likely cause of a posting no space failure.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska March 30, 2017, 2:23 p.m. UTC | #15
>>> You say above "we post *up to* 2 work requests", unless you wish to
>>> change that to "we always post at least 2 work requests per queue
>>> entry", Jason is right, your frequency of signaling needs to be X/2
>>> regardless of your CQ size, you need the signaling to control the queue
>>> depth tracking.
>>
>> If you would like to spread things out farther between signaling, then
>> you can modify your send routine to only increment the send counter for
>> actual send requests, ignoring registration WQEs and invalidate WQES,
>> and then signal every X/2 sends.
> 
> Yea, you're right, and not only I got it wrong, I even contradicted my
> own suggestion that was exactly what you and Jason suggested (where is
> the nearest rat-hole...)
> 
> So I suggested to signal every X/2 and Marta reported SQ overflows for
> high queue-dpeth. Marta, at what queue-depth have you seen this?

The remote side had queue depth of 16 or 32 and that's the WQ on the
initiator side that overflows (mlx5_wq_overflow). We're testing with
signalling X/2 and it seems to work.

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska April 6, 2017, 12:29 p.m. UTC | #16
>>>> You say above "we post *up to* 2 work requests", unless you wish to
>>>> change that to "we always post at least 2 work requests per queue
>>>> entry", Jason is right, your frequency of signaling needs to be X/2
>>>> regardless of your CQ size, you need the signaling to control the queue
>>>> depth tracking.
>>>
>>> If you would like to spread things out farther between signaling, then
>>> you can modify your send routine to only increment the send counter for
>>> actual send requests, ignoring registration WQEs and invalidate WQES,
>>> and then signal every X/2 sends.
>> 
>> Yea, you're right, and not only I got it wrong, I even contradicted my
>> own suggestion that was exactly what you and Jason suggested (where is
>> the nearest rat-hole...)
>> 
>> So I suggested to signal every X/2 and Marta reported SQ overflows for
>> high queue-dpeth. Marta, at what queue-depth have you seen this?
> 
> The remote side had queue depth of 16 or 32 and that's the WQ on the
> initiator side that overflows (mlx5_wq_overflow). We're testing with
> signalling X/2 and it seems to work.

Update on the situation: the signalling on X/2 seems to work fine in
practice. To clarify more that's the send queue that overflows
(mlx5_wq_overflow in begin_wqe of drivers/infiniband/hw/mlx5/qp.c).

However, I have still doubt how it's going to work in the case of
higher queue depths (i.e. the typical case). If we signal every X/2
we'll do it much more rarely than today (every 32 messages). I'm not
sure on the system effect this would have. 

Mellanox guys, do you have an idea what it might do?

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky April 6, 2017, 1:02 p.m. UTC | #17
On Thu, Apr 06, 2017 at 02:29:03PM +0200, Marta Rybczynska wrote:
> >>>> You say above "we post *up to* 2 work requests", unless you wish to
> >>>> change that to "we always post at least 2 work requests per queue
> >>>> entry", Jason is right, your frequency of signaling needs to be X/2
> >>>> regardless of your CQ size, you need the signaling to control the queue
> >>>> depth tracking.
> >>>
> >>> If you would like to spread things out farther between signaling, then
> >>> you can modify your send routine to only increment the send counter for
> >>> actual send requests, ignoring registration WQEs and invalidate WQES,
> >>> and then signal every X/2 sends.
> >>
> >> Yea, you're right, and not only I got it wrong, I even contradicted my
> >> own suggestion that was exactly what you and Jason suggested (where is
> >> the nearest rat-hole...)
> >>
> >> So I suggested to signal every X/2 and Marta reported SQ overflows for
> >> high queue-dpeth. Marta, at what queue-depth have you seen this?
> >
> > The remote side had queue depth of 16 or 32 and that's the WQ on the
> > initiator side that overflows (mlx5_wq_overflow). We're testing with
> > signalling X/2 and it seems to work.
>
> Update on the situation: the signalling on X/2 seems to work fine in
> practice. To clarify more that's the send queue that overflows
> (mlx5_wq_overflow in begin_wqe of drivers/infiniband/hw/mlx5/qp.c).
>
> However, I have still doubt how it's going to work in the case of
> higher queue depths (i.e. the typical case). If we signal every X/2
> we'll do it much more rarely than today (every 32 messages). I'm not
> sure on the system effect this would have.
>
> Mellanox guys, do you have an idea what it might do?

It will continue to work as expected with long depths too.
All that you need is do not to forget to issue signal if queue is terminated.

Thanks

>
> Marta
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska April 7, 2017, 1:31 p.m. UTC | #18
> On Thu, Apr 06, 2017 at 02:29:03PM +0200, Marta Rybczynska wrote:
>> >>>> You say above "we post *up to* 2 work requests", unless you wish to
>> >>>> change that to "we always post at least 2 work requests per queue
>> >>>> entry", Jason is right, your frequency of signaling needs to be X/2
>> >>>> regardless of your CQ size, you need the signaling to control the queue
>> >>>> depth tracking.
>> >>>
>> >>> If you would like to spread things out farther between signaling, then
>> >>> you can modify your send routine to only increment the send counter for
>> >>> actual send requests, ignoring registration WQEs and invalidate WQES,
>> >>> and then signal every X/2 sends.
>> >>
>> >> Yea, you're right, and not only I got it wrong, I even contradicted my
>> >> own suggestion that was exactly what you and Jason suggested (where is
>> >> the nearest rat-hole...)
>> >>
>> >> So I suggested to signal every X/2 and Marta reported SQ overflows for
>> >> high queue-dpeth. Marta, at what queue-depth have you seen this?
>> >
>> > The remote side had queue depth of 16 or 32 and that's the WQ on the
>> > initiator side that overflows (mlx5_wq_overflow). We're testing with
>> > signalling X/2 and it seems to work.
>>
>> Update on the situation: the signalling on X/2 seems to work fine in
>> practice. To clarify more that's the send queue that overflows
>> (mlx5_wq_overflow in begin_wqe of drivers/infiniband/hw/mlx5/qp.c).
>>
>> However, I have still doubt how it's going to work in the case of
>> higher queue depths (i.e. the typical case). If we signal every X/2
>> we'll do it much more rarely than today (every 32 messages). I'm not
>> sure on the system effect this would have.
>>
>> Mellanox guys, do you have an idea what it might do?
> 
> It will continue to work as expected with long depths too.
> All that you need is do not to forget to issue signal if queue is terminated.
> 

Thanks Leon. I will then submit the v2.

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg April 9, 2017, 12:31 p.m. UTC | #19
>>> Mellanox guys, do you have an idea what it might do?
>>
>> It will continue to work as expected with long depths too.
>> All that you need is do not to forget to issue signal if queue is terminated.
>>
>
> Thanks Leon. I will then submit the v2.

Marta, can you please test with higher queue-depth? Say 512 and/or 1024?

Thanks,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska April 10, 2017, 11:29 a.m. UTC | #20
>>>> Mellanox guys, do you have an idea what it might do?
>>>
>>> It will continue to work as expected with long depths too.
>>> All that you need is do not to forget to issue signal if queue is terminated.
>>>
>>
>> Thanks Leon. I will then submit the v2.
> 
> Marta, can you please test with higher queue-depth? Say 512 and/or 1024?
> 

Yes, we'll do that.

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 779f516..8ea4cba 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1023,6 +1023,7 @@  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 {
        struct ib_send_wr wr, *bad_wr;
        int ret;
+       int sig_limit;
 
        sge->addr   = qe->dma;
        sge->length = sizeof(struct nvme_command),
@@ -1054,7 +1055,8 @@  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
         * embedded in request's payload, is not freed when __ib_process_cq()
         * calls wr_cqe->done().
         */
-       if ((++queue->sig_count % 32) == 0 || flush)
+       sig_limit = min(queue->queue_size, 32);
+       if ((++queue->sig_count % sig_limit) == 0 || flush)
                wr.send_flags |= IB_SEND_SIGNALED;
 
        if (first)