Message ID | 1315914765.312051621.1490259849534.JavaMail.zimbra@kalray.eu (mailing list archive) |
---|---|
State | RFC |
Headers | show |
On Thu, Mar 23, 2017 at 10:04:09AM +0100, Marta Rybczynska wrote: > In the case of small NVMe-oF queue size (<32) we may enter > a deadlock caused by the fact that the IB completions aren't sent > waiting for 32 and the send queue will fill up. > > The error is seen as (using mlx5): > [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273): > [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12 > > The patch doesn't change the behaviour for remote devices with > larger queues. Thanks, this looks useful. But wouldn't it be better to do something like queue_size divided by 2 or 4 to get a better refill latency? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Mail original ----- > On Thu, Mar 23, 2017 at 10:04:09AM +0100, Marta Rybczynska wrote: >> In the case of small NVMe-oF queue size (<32) we may enter >> a deadlock caused by the fact that the IB completions aren't sent >> waiting for 32 and the send queue will fill up. >> >> The error is seen as (using mlx5): >> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273): >> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12 >> >> The patch doesn't change the behaviour for remote devices with >> larger queues. > > Thanks, this looks useful. But wouldn't it be better to do something > like queue_size divided by 2 or 4 to get a better refill latency? That's an interesting question. The max number of requests is already at 3 or 4 times of the queue size because of different message types (see Sam's original message in 'NVMe RDMA driver: CX4 send queue fills up when nvme queue depth is low'). I guess it would have inflence on configs with bigger latency. I would like to have Sagi's view on this as he's the one who has changed that part in the iSER initiator in 6df5a128f0fde6315a44e80b30412997147f5efd Marta -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Thanks, this looks useful. But wouldn't it be better to do something >> like queue_size divided by 2 or 4 to get a better refill latency? > > That's an interesting question. The max number of requests is already at 3 or 4 times > of the queue size because of different message types (see Sam's original > message in 'NVMe RDMA driver: CX4 send queue fills up when nvme queue depth is low'). > I guess it would have inflence on configs with bigger latency. > > I would like to have Sagi's view on this as he's the one who has changed that > part in the iSER initiator in 6df5a128f0fde6315a44e80b30412997147f5efd Hi Marta, This looks good indeed. IIRC both for IB and iWARP we need to signal at least once every send-queue depth (in practice I remember that some devices need more than once) so maybe it'll be good to have division by 2. Maybe it'll be better if we do: static inline bool queue_sig_limit(struct nvme_rdma_queue *queue) { return (++queue->sig_count % (queue->queue_size / 2)) == 0; } And lose the hard-coded 32 entirely. Care to test that? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Mail original ----- >>> Thanks, this looks useful. But wouldn't it be better to do something >>> like queue_size divided by 2 or 4 to get a better refill latency? >> >> That's an interesting question. The max number of requests is already at 3 or 4 >> times >> of the queue size because of different message types (see Sam's original >> message in 'NVMe RDMA driver: CX4 send queue fills up when nvme queue depth is >> low'). >> I guess it would have inflence on configs with bigger latency. >> >> I would like to have Sagi's view on this as he's the one who has changed that >> part in the iSER initiator in 6df5a128f0fde6315a44e80b30412997147f5efd > > Hi Marta, > > This looks good indeed. IIRC both for IB and iWARP we need to signal > at least once every send-queue depth (in practice I remember that some > devices need more than once) so maybe it'll be good to have division by > 2. > > Maybe it'll be better if we do: > > static inline bool queue_sig_limit(struct nvme_rdma_queue *queue) > { > return (++queue->sig_count % (queue->queue_size / 2)) == 0; > } > > And lose the hard-coded 32 entirely. Care to test that? Hello Sigi, I agree with you, we've found a setup where the signalling every queue depth is not enough and we're testing the division by two that seems to work fine till now. In your version in case of queue length > 32 the notifications would be sent less often that they are now. I'm wondering if it will have impact on performance and internal card buffering (it seems that Mellanox buffers are ~100 elements). Wouldn't it create issues? I'd like see the magic constant removed. From what I can see we need to have something not exceeding send buffer of the card but also not lower than the queue depth. What do you think? Marta -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Maybe it'll be better if we do: >> >> static inline bool queue_sig_limit(struct nvme_rdma_queue *queue) >> { >> return (++queue->sig_count % (queue->queue_size / 2)) == 0; >> } >> >> And lose the hard-coded 32 entirely. Care to test that? > > Hello Sigi, > I agree with you, we've found a setup where the signalling every queue > depth is not enough and we're testing the division by two that seems > to work fine till now. > > In your version in case of queue length > 32 the notifications would > be sent less often that they are now. I'm wondering if it will have > impact on performance and internal card buffering (it seems that > Mellanox buffers are ~100 elements). Wouldn't it create issues? > > I'd like see the magic constant removed. From what I can see we > need to have something not exceeding send buffer of the card but > also not lower than the queue depth. What do you think? I'm not sure what buffering is needed from the device at all in this case, the device is simply expected to avoid signaling completions. Mellanox folks, any idea where is this limitation coming from? Do we need a device capability for it? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Maybe it'll be better if we do: >>> >>> static inline bool queue_sig_limit(struct nvme_rdma_queue *queue) >>> { >>> return (++queue->sig_count % (queue->queue_size / 2)) == 0; >>> } >>> >>> And lose the hard-coded 32 entirely. Care to test that? >> >> Hello Sigi, >> I agree with you, we've found a setup where the signalling every queue >> depth is not enough and we're testing the division by two that seems >> to work fine till now. >> >> In your version in case of queue length > 32 the notifications would >> be sent less often that they are now. I'm wondering if it will have >> impact on performance and internal card buffering (it seems that >> Mellanox buffers are ~100 elements). Wouldn't it create issues? >> >> I'd like see the magic constant removed. From what I can see we >> need to have something not exceeding send buffer of the card but >> also not lower than the queue depth. What do you think? > > I'm not sure what buffering is needed from the device at all in this > case, the device is simply expected to avoid signaling completions. > > Mellanox folks, any idea where is this limitation coming from? > Do we need a device capability for it? In the case of mlx5 the we're getting -ENOMEM from begin_wqe (condition on mlx5_wq_overflow). This queue is sized in the driver based on multiple factors. If we ack less often, this could happen for higher queue depths too, I think. Marta -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 28, 2017 at 02:30:14PM +0300, Sagi Grimberg wrote: > I'm not sure what buffering is needed from the device at all in this > case, the device is simply expected to avoid signaling completions. > > Mellanox folks, any idea where is this limitation coming from? > Do we need a device capability for it? Fundamentally you must drive SQ flow control via CQ completions. For instance a ULP cannot disable all CQ notifications and keep stuffing things into the SQ. An alternative way to state this: A ULP cannot use activity on the RQ to infer that there is space in the SQ. Only CQ completions can be used to prove there is more available SQ space. Do not post to the SQ until a CQ has been polled proving available space. Ultimately you need a minimum of one CQ notification for every SQ depth post and the ULP must not post to the SQ once it fills until it sees the CQ notification. That usually drives the rule of thumb to notify every 1/2 depth, however any SQWE posting failures indicate a ULP bug.. There are a bunch of varied reasons for this, and it was discussed to death for NFS. NFS's bugs and wonkyness in this area went away when Chuck did strict accounting SQ capicty driven by the CQ... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jason, > Fundamentally you must drive SQ flow control via CQ completions. For > instance a ULP cannot disable all CQ notifications and keep > stuffing things into the SQ. > > An alternative way to state this: A ULP cannot use activity on the > RQ to infer that there is space in the SQ. Only CQ completions can be > used to prove there is more available SQ space. Do not post to the SQ > until a CQ has been polled proving available space. The recv queue is out of the ball game here... We just selectively signal send completions only to reduce some interrupts and completion processing. > Ultimately you need a minimum of one CQ notification for every SQ > depth post and the ULP must not post to the SQ once it fills until > it sees the CQ notification. That usually drives the rule of thumb to > notify every 1/2 depth, however any SQWE posting failures indicate a > ULP bug.. Well, usually, but I'm not convinced this is the case. For each I/O we post up to 2 work requests, 1 for memory registration and 1 for sending an I/O request (and 1 for local invalidation if the target doesn't do it for us, but that is not the case here). So if our queue depth is X, we size our completion queue to be X*3, and we need to make sure we signal every (X*3)/2. What I think Marta is seeing, is that no matter how long we size our send-queue (in other words no matter how high X is) we're not able to get away with suppress send completion signaling for more than ~100 posts. Please correct me if I'm wrong Marta? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 29, 2017 at 06:47:54PM +0300, Sagi Grimberg wrote: > For each I/O we post up to 2 work requests, 1 for memory registration > and 1 for sending an I/O request (and 1 for local invalidation if the > target doesn't do it for us, but that is not the case here). So if our > queue depth is X, we size our completion queue to be X*3, and we need > to make sure we signal every (X*3)/2. ??? If your SQ is X and your CQ is X*3 you need to signal at X/2. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> For each I/O we post up to 2 work requests, 1 for memory registration >> and 1 for sending an I/O request (and 1 for local invalidation if the >> target doesn't do it for us, but that is not the case here). So if our >> queue depth is X, we size our completion queue to be X*3, and we need >> to make sure we signal every (X*3)/2. > > ??? If your SQ is X and your CQ is X*3 you need to signal at X/2. Sorry, I confused SQ with CQ (which made it even more confusing..) Our application queue-depth is X, we size our SQ to be X*3 (send+reg+inv), we size our RQ to be X (resp) and our CQ to be X*4 (SQ+RQ). So we should signal every (X*3)/2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/29/17 11:39 AM, Sagi Grimberg wrote: > >>> For each I/O we post up to 2 work requests, 1 for memory registration >>> and 1 for sending an I/O request (and 1 for local invalidation if the >>> target doesn't do it for us, but that is not the case here). So if our >>> queue depth is X, we size our completion queue to be X*3, and we need >>> to make sure we signal every (X*3)/2. >> >> ??? If your SQ is X and your CQ is X*3 you need to signal at X/2. > > Sorry, I confused SQ with CQ (which made it even more confusing..) > > Our application queue-depth is X, we size our SQ to be X*3 > (send+reg+inv), we size our RQ to be X (resp) and our CQ to be > X*4 (SQ+RQ). > > So we should signal every (X*3)/2 You say above "we post *up to* 2 work requests", unless you wish to change that to "we always post at least 2 work requests per queue entry", Jason is right, your frequency of signaling needs to be X/2 regardless of your CQ size, you need the signaling to control the queue depth tracking. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/29/17 11:44 AM, Doug Ledford wrote: > On 3/29/17 11:39 AM, Sagi Grimberg wrote: >> >>>> For each I/O we post up to 2 work requests, 1 for memory registration >>>> and 1 for sending an I/O request (and 1 for local invalidation if the >>>> target doesn't do it for us, but that is not the case here). So if our >>>> queue depth is X, we size our completion queue to be X*3, and we need >>>> to make sure we signal every (X*3)/2. >>> >>> ??? If your SQ is X and your CQ is X*3 you need to signal at X/2. >> >> Sorry, I confused SQ with CQ (which made it even more confusing..) >> >> Our application queue-depth is X, we size our SQ to be X*3 >> (send+reg+inv), we size our RQ to be X (resp) and our CQ to be >> X*4 (SQ+RQ). >> >> So we should signal every (X*3)/2 > > You say above "we post *up to* 2 work requests", unless you wish to > change that to "we always post at least 2 work requests per queue > entry", Jason is right, your frequency of signaling needs to be X/2 > regardless of your CQ size, you need the signaling to control the queue > depth tracking. If you would like to spread things out farther between signaling, then you can modify your send routine to only increment the send counter for actual send requests, ignoring registration WQEs and invalidate WQES, and then signal every X/2 sends. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> You say above "we post *up to* 2 work requests", unless you wish to >> change that to "we always post at least 2 work requests per queue >> entry", Jason is right, your frequency of signaling needs to be X/2 >> regardless of your CQ size, you need the signaling to control the queue >> depth tracking. > > If you would like to spread things out farther between signaling, then > you can modify your send routine to only increment the send counter for > actual send requests, ignoring registration WQEs and invalidate WQES, > and then signal every X/2 sends. Yea, you're right, and not only I got it wrong, I even contradicted my own suggestion that was exactly what you and Jason suggested (where is the nearest rat-hole...) So I suggested to signal every X/2 and Marta reported SQ overflows for high queue-dpeth. Marta, at what queue-depth have you seen this? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 29, 2017 at 07:59:13PM +0300, Sagi Grimberg wrote: > Yea, you're right, and not only I got it wrong, I even contradicted my > own suggestion that was exactly what you and Jason suggested (where is > the nearest rat-hole...) > > So I suggested to signal every X/2 and Marta reported SQ overflows for > high queue-dpeth. Marta, at what queue-depth have you seen this? I just want to clarify my comment about RQ - because you are talking about a queue depth concept. It is tempting to rely on a queue depth of X == SQ depth of X (or 2* in this case) and then try to re-use the flow control on the queue to protect the SQ from overflow. Generally this does not work because the main queue can retire work from either the CQ or the RQ. If the retire workload is mainly RQ based then you can submit to the SQ without a CQ poll and overflow it. Generally I expect to see direct measurement of SQ capacity and plug the main queue when the SQ goes full. To check this use a simple assertion, decrement a counter on every post, increment it appropriately on every polled CQ, init to the SQ depth, and assert the counter never goes negative. Triggering the assertion is an unconditional ULP bug, and is the most likely cause of a posting no space failure. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> You say above "we post *up to* 2 work requests", unless you wish to >>> change that to "we always post at least 2 work requests per queue >>> entry", Jason is right, your frequency of signaling needs to be X/2 >>> regardless of your CQ size, you need the signaling to control the queue >>> depth tracking. >> >> If you would like to spread things out farther between signaling, then >> you can modify your send routine to only increment the send counter for >> actual send requests, ignoring registration WQEs and invalidate WQES, >> and then signal every X/2 sends. > > Yea, you're right, and not only I got it wrong, I even contradicted my > own suggestion that was exactly what you and Jason suggested (where is > the nearest rat-hole...) > > So I suggested to signal every X/2 and Marta reported SQ overflows for > high queue-dpeth. Marta, at what queue-depth have you seen this? The remote side had queue depth of 16 or 32 and that's the WQ on the initiator side that overflows (mlx5_wq_overflow). We're testing with signalling X/2 and it seems to work. Marta -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> You say above "we post *up to* 2 work requests", unless you wish to >>>> change that to "we always post at least 2 work requests per queue >>>> entry", Jason is right, your frequency of signaling needs to be X/2 >>>> regardless of your CQ size, you need the signaling to control the queue >>>> depth tracking. >>> >>> If you would like to spread things out farther between signaling, then >>> you can modify your send routine to only increment the send counter for >>> actual send requests, ignoring registration WQEs and invalidate WQES, >>> and then signal every X/2 sends. >> >> Yea, you're right, and not only I got it wrong, I even contradicted my >> own suggestion that was exactly what you and Jason suggested (where is >> the nearest rat-hole...) >> >> So I suggested to signal every X/2 and Marta reported SQ overflows for >> high queue-dpeth. Marta, at what queue-depth have you seen this? > > The remote side had queue depth of 16 or 32 and that's the WQ on the > initiator side that overflows (mlx5_wq_overflow). We're testing with > signalling X/2 and it seems to work. Update on the situation: the signalling on X/2 seems to work fine in practice. To clarify more that's the send queue that overflows (mlx5_wq_overflow in begin_wqe of drivers/infiniband/hw/mlx5/qp.c). However, I have still doubt how it's going to work in the case of higher queue depths (i.e. the typical case). If we signal every X/2 we'll do it much more rarely than today (every 32 messages). I'm not sure on the system effect this would have. Mellanox guys, do you have an idea what it might do? Marta -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 06, 2017 at 02:29:03PM +0200, Marta Rybczynska wrote: > >>>> You say above "we post *up to* 2 work requests", unless you wish to > >>>> change that to "we always post at least 2 work requests per queue > >>>> entry", Jason is right, your frequency of signaling needs to be X/2 > >>>> regardless of your CQ size, you need the signaling to control the queue > >>>> depth tracking. > >>> > >>> If you would like to spread things out farther between signaling, then > >>> you can modify your send routine to only increment the send counter for > >>> actual send requests, ignoring registration WQEs and invalidate WQES, > >>> and then signal every X/2 sends. > >> > >> Yea, you're right, and not only I got it wrong, I even contradicted my > >> own suggestion that was exactly what you and Jason suggested (where is > >> the nearest rat-hole...) > >> > >> So I suggested to signal every X/2 and Marta reported SQ overflows for > >> high queue-dpeth. Marta, at what queue-depth have you seen this? > > > > The remote side had queue depth of 16 or 32 and that's the WQ on the > > initiator side that overflows (mlx5_wq_overflow). We're testing with > > signalling X/2 and it seems to work. > > Update on the situation: the signalling on X/2 seems to work fine in > practice. To clarify more that's the send queue that overflows > (mlx5_wq_overflow in begin_wqe of drivers/infiniband/hw/mlx5/qp.c). > > However, I have still doubt how it's going to work in the case of > higher queue depths (i.e. the typical case). If we signal every X/2 > we'll do it much more rarely than today (every 32 messages). I'm not > sure on the system effect this would have. > > Mellanox guys, do you have an idea what it might do? It will continue to work as expected with long depths too. All that you need is do not to forget to issue signal if queue is terminated. Thanks > > Marta > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Thu, Apr 06, 2017 at 02:29:03PM +0200, Marta Rybczynska wrote: >> >>>> You say above "we post *up to* 2 work requests", unless you wish to >> >>>> change that to "we always post at least 2 work requests per queue >> >>>> entry", Jason is right, your frequency of signaling needs to be X/2 >> >>>> regardless of your CQ size, you need the signaling to control the queue >> >>>> depth tracking. >> >>> >> >>> If you would like to spread things out farther between signaling, then >> >>> you can modify your send routine to only increment the send counter for >> >>> actual send requests, ignoring registration WQEs and invalidate WQES, >> >>> and then signal every X/2 sends. >> >> >> >> Yea, you're right, and not only I got it wrong, I even contradicted my >> >> own suggestion that was exactly what you and Jason suggested (where is >> >> the nearest rat-hole...) >> >> >> >> So I suggested to signal every X/2 and Marta reported SQ overflows for >> >> high queue-dpeth. Marta, at what queue-depth have you seen this? >> > >> > The remote side had queue depth of 16 or 32 and that's the WQ on the >> > initiator side that overflows (mlx5_wq_overflow). We're testing with >> > signalling X/2 and it seems to work. >> >> Update on the situation: the signalling on X/2 seems to work fine in >> practice. To clarify more that's the send queue that overflows >> (mlx5_wq_overflow in begin_wqe of drivers/infiniband/hw/mlx5/qp.c). >> >> However, I have still doubt how it's going to work in the case of >> higher queue depths (i.e. the typical case). If we signal every X/2 >> we'll do it much more rarely than today (every 32 messages). I'm not >> sure on the system effect this would have. >> >> Mellanox guys, do you have an idea what it might do? > > It will continue to work as expected with long depths too. > All that you need is do not to forget to issue signal if queue is terminated. > Thanks Leon. I will then submit the v2. Marta -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Mellanox guys, do you have an idea what it might do? >> >> It will continue to work as expected with long depths too. >> All that you need is do not to forget to issue signal if queue is terminated. >> > > Thanks Leon. I will then submit the v2. Marta, can you please test with higher queue-depth? Say 512 and/or 1024? Thanks, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> Mellanox guys, do you have an idea what it might do? >>> >>> It will continue to work as expected with long depths too. >>> All that you need is do not to forget to issue signal if queue is terminated. >>> >> >> Thanks Leon. I will then submit the v2. > > Marta, can you please test with higher queue-depth? Say 512 and/or 1024? > Yes, we'll do that. Marta -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 779f516..8ea4cba 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1023,6 +1023,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue, { struct ib_send_wr wr, *bad_wr; int ret; + int sig_limit; sge->addr = qe->dma; sge->length = sizeof(struct nvme_command), @@ -1054,7 +1055,8 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue, * embedded in request's payload, is not freed when __ib_process_cq() * calls wr_cqe->done(). */ - if ((++queue->sig_count % 32) == 0 || flush) + sig_limit = min(queue->queue_size, 32); + if ((++queue->sig_count % sig_limit) == 0 || flush) wr.send_flags |= IB_SEND_SIGNALED; if (first)