diff mbox series

[for-rc] siw: MPA Reply handler tries to read beyond MPA message

Message ID 20190731103310.23199-1-krishna2@chelsio.com (mailing list archive)
State Accepted
Headers show
Series [for-rc] siw: MPA Reply handler tries to read beyond MPA message | expand

Commit Message

Krishnamraju Eraparaju July 31, 2019, 10:33 a.m. UTC
while processing MPA Reply, SIW driver is trying to read extra 4 bytes
than what peer has advertised as private data length.

If a FPDU data is received before even siw_recv_mpa_rr() completed
reading MPA reply, then ksock_recv() in siw_recv_mpa_rr() could also
read FPDU, if "size" is larger than advertised MPA reply length.

 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
 502 {
          .............
 572
 573         if (rcvd > to_rcv)
 574                 return -EPROTO;   <----- Failure here

Looks like the intention here is to throw an ERROR if the received data
is more than the total private data length advertised by the peer. But
reading beyond MPA message causes siw_cm to generate
RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer is already
queued with FPDU messages.

Hence, this function should only read upto private data length.

Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
---
 drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Doug Ledford July 31, 2019, 7:17 p.m. UTC | #1
On Wed, 2019-07-31 at 16:03 +0530, Krishnamraju Eraparaju wrote:
> while processing MPA Reply, SIW driver is trying to read extra 4 bytes
> than what peer has advertised as private data length.
> 
> If a FPDU data is received before even siw_recv_mpa_rr() completed
> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr() could also
> read FPDU, if "size" is larger than advertised MPA reply length.
> 
>  501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>  502 {
>           .............
>  572
>  573         if (rcvd > to_rcv)
>  574                 return -EPROTO;   <----- Failure here
> 
> Looks like the intention here is to throw an ERROR if the received
> data
> is more than the total private data length advertised by the peer. But
> reading beyond MPA message causes siw_cm to generate
> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer is
> already
> queued with FPDU messages.
> 
> Hence, this function should only read upto private data length.
> 
> Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>

Once you apply this patch, the if (rcvd > to_rcv) test you listed above
in the commit message becomes dead code.  So I removed it while applying
the patch.  Thanks.
Bernard Metzler Aug. 1, 2019, 10:56 a.m. UTC | #2
-----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----

>To: jgg@ziepe.ca, bmt@zurich.ibm.com
>From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>Date: 07/31/2019 12:34PM
>Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju Eraparaju"
><krishna2@chelsio.com>
>Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler tries to
>read beyond MPA message
>
>while processing MPA Reply, SIW driver is trying to read extra 4
>bytes
>than what peer has advertised as private data length.
>
>If a FPDU data is received before even siw_recv_mpa_rr() completed
>reading MPA reply, then ksock_recv() in siw_recv_mpa_rr() could also
>read FPDU, if "size" is larger than advertised MPA reply length.
>
> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
> 502 {
>          .............
> 572
> 573         if (rcvd > to_rcv)
> 574                 return -EPROTO;   <----- Failure here
>
>Looks like the intention here is to throw an ERROR if the received
>data
>is more than the total private data length advertised by the peer.
>But
>reading beyond MPA message causes siw_cm to generate
>RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer is
>already
>queued with FPDU messages.
>
>Hence, this function should only read upto private data length.
>
>Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
>---
> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>b/drivers/infiniband/sw/siw/siw_cm.c
>index a7cde98e73e8..8dc8cea2566c 100644
>--- a/drivers/infiniband/sw/siw/siw_cm.c
>+++ b/drivers/infiniband/sw/siw/siw_cm.c
>@@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct siw_cep *cep)
> 	 * A private data buffer gets allocated if hdr->params.pd_len != 0.
> 	 */
> 	if (!cep->mpa.pdata) {
>-		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>+		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
> 		if (!cep->mpa.pdata)
> 			return -ENOMEM;
> 	}
> 	rcvd = ksock_recv(
> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct mpa_rr),
>-		to_rcv + 4, MSG_DONTWAIT);
>+		to_rcv, MSG_DONTWAIT);
> 
> 	if (rcvd < 0)
> 		return rcvd;
>-- 
>2.23.0.rc0
>
>

The intention of this code is to make sure the
peer does not violates the MPA handshake rules.
The initiator MUST NOT send extra data after its
MPA request and before receiving the MPA reply.
So, for the MPA request case, this code is needed
to check for protocol correctness.
You are right for the MPA reply case - if we are
_not_ in peer2peer mode, the peer can immediately
start sending data in RDMA mode after its MPA Reply.
So we shall add appropriate code to be more specific
For an error, we are (1) processing an MPA Request,
OR (2) processing an MPA Reply AND we are not expected
to send an initial READ/WRITE/Send as negotiated with
the peer (peer2peer mode MPA handshake).

Just removing this check would make siw more permissive,
but to a point where peer MPA protocol errors are
tolerated. I am not in favor of that level of
forgiveness.

If possible, please provide an appropriate patch
or (if it causes current issues with another peer
iWarp implementation) just run in MPA peer2peer mode,
where the current check is appropriate.
Otherwise, I would provide an appropriate fix by Monday
(I am still out of office this week).


Many thanks and best regards,
Bernard.
Krishnamraju Eraparaju Aug. 1, 2019, 11:34 a.m. UTC | #3
On Wednesday, July 07/31/19, 2019 at 15:17:40 -0400, Doug Ledford wrote:
> On Wed, 2019-07-31 at 16:03 +0530, Krishnamraju Eraparaju wrote:
> > while processing MPA Reply, SIW driver is trying to read extra 4 bytes
> > than what peer has advertised as private data length.
> > 
> > If a FPDU data is received before even siw_recv_mpa_rr() completed
> > reading MPA reply, then ksock_recv() in siw_recv_mpa_rr() could also
> > read FPDU, if "size" is larger than advertised MPA reply length.
> > 
> >  501 static int siw_recv_mpa_rr(struct siw_cep *cep)
> >  502 {
> >           .............
> >  572
> >  573         if (rcvd > to_rcv)
> >  574                 return -EPROTO;   <----- Failure here
> > 
> > Looks like the intention here is to throw an ERROR if the received
> > data
> > is more than the total private data length advertised by the peer. But
> > reading beyond MPA message causes siw_cm to generate
> > RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer is
> > already
> > queued with FPDU messages.
> > 
> > Hence, this function should only read upto private data length.
> > 
> > Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
> 
> Once you apply this patch, the if (rcvd > to_rcv) test you listed above
> in the commit message becomes dead code.  So I removed it while applying
> the patch.  Thanks.
> 

Thanks Doug.

> -- 
> Doug Ledford <dledford@redhat.com>
>     GPG KeyID: B826A3330E572FDD
>     Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
Doug Ledford Aug. 1, 2019, 3:36 p.m. UTC | #4
On Thu, 2019-08-01 at 10:56 +0000, Bernard Metzler wrote:
> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
> 
> > To: jgg@ziepe.ca, bmt@zurich.ibm.com
> > From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> > Date: 07/31/2019 12:34PM
> > Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
> > nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju Eraparaju"
> > <krishna2@chelsio.com>
> > Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler tries to
> > read beyond MPA message
> > 
> > while processing MPA Reply, SIW driver is trying to read extra 4
> > bytes
> > than what peer has advertised as private data length.
> > 
> > If a FPDU data is received before even siw_recv_mpa_rr() completed
> > reading MPA reply, then ksock_recv() in siw_recv_mpa_rr() could also
> > read FPDU, if "size" is larger than advertised MPA reply length.
> > 
> > 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
> > 502 {
> >          .............
> > 572
> > 573         if (rcvd > to_rcv)
> > 574                 return -EPROTO;   <----- Failure here
> > 
> > Looks like the intention here is to throw an ERROR if the received
> > data
> > is more than the total private data length advertised by the peer.
> > But
> > reading beyond MPA message causes siw_cm to generate
> > RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer is
> > already
> > queued with FPDU messages.
> > 
> > Hence, this function should only read upto private data length.
> > 
> > Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
> > ---
> > drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> > b/drivers/infiniband/sw/siw/siw_cm.c
> > index a7cde98e73e8..8dc8cea2566c 100644
> > --- a/drivers/infiniband/sw/siw/siw_cm.c
> > +++ b/drivers/infiniband/sw/siw/siw_cm.c
> > @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct siw_cep
> > *cep)
> > 	 * A private data buffer gets allocated if hdr->params.pd_len !=
> > 0.
> > 	 */
> > 	if (!cep->mpa.pdata) {
> > -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
> > +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
> > 		if (!cep->mpa.pdata)
> > 			return -ENOMEM;
> > 	}
> > 	rcvd = ksock_recv(
> > 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct
> > mpa_rr),
> > -		to_rcv + 4, MSG_DONTWAIT);
> > +		to_rcv, MSG_DONTWAIT);
> > 
> > 	if (rcvd < 0)
> > 		return rcvd;
> > -- 
> > 2.23.0.rc0
> > 
> > 
> 
> The intention of this code is to make sure the
> peer does not violates the MPA handshake rules.
> The initiator MUST NOT send extra data after its
> MPA request and before receiving the MPA reply.
> So, for the MPA request case, this code is needed
> to check for protocol correctness.
> You are right for the MPA reply case - if we are
> _not_ in peer2peer mode, the peer can immediately
> start sending data in RDMA mode after its MPA Reply.
> So we shall add appropriate code to be more specific
> For an error, we are (1) processing an MPA Request,
> OR (2) processing an MPA Reply AND we are not expected
> to send an initial READ/WRITE/Send as negotiated with
> the peer (peer2peer mode MPA handshake).
> 
> Just removing this check would make siw more permissive,
> but to a point where peer MPA protocol errors are
> tolerated. I am not in favor of that level of
> forgiveness.
> 
> If possible, please provide an appropriate patch
> or (if it causes current issues with another peer
> iWarp implementation) just run in MPA peer2peer mode,
> where the current check is appropriate.
> Otherwise, I would provide an appropriate fix by Monday
> (I am still out of office this week).
> 
> 
> Many thanks and best regards,
> Bernard.
> 

I had taken the patch, but I've since dropped it (it had only made it to
my wip/dl-for-rc branch).  I'll wait for the proper fix (I looked, and I
could code it up, but I don't know what tests to use in the conditionals
since I don't know how you track peer2peer mode on the connection).
Tom Talpey Aug. 1, 2019, 6:53 p.m. UTC | #5
On 8/1/2019 6:56 AM, Bernard Metzler wrote:
> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
> 
>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>> Date: 07/31/2019 12:34PM
>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju Eraparaju"
>> <krishna2@chelsio.com>
>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler tries to
>> read beyond MPA message
>>
>> while processing MPA Reply, SIW driver is trying to read extra 4
>> bytes
>> than what peer has advertised as private data length.
>>
>> If a FPDU data is received before even siw_recv_mpa_rr() completed
>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr() could also
>> read FPDU, if "size" is larger than advertised MPA reply length.
>>
>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>> 502 {
>>           .............
>> 572
>> 573         if (rcvd > to_rcv)
>> 574                 return -EPROTO;   <----- Failure here
>>
>> Looks like the intention here is to throw an ERROR if the received
>> data
>> is more than the total private data length advertised by the peer.
>> But
>> reading beyond MPA message causes siw_cm to generate
>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer is
>> already
>> queued with FPDU messages.
>>
>> Hence, this function should only read upto private data length.
>>
>> Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
>> ---
>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>> b/drivers/infiniband/sw/siw/siw_cm.c
>> index a7cde98e73e8..8dc8cea2566c 100644
>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct siw_cep *cep)
>> 	 * A private data buffer gets allocated if hdr->params.pd_len != 0.
>> 	 */
>> 	if (!cep->mpa.pdata) {
>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>> 		if (!cep->mpa.pdata)
>> 			return -ENOMEM;
>> 	}
>> 	rcvd = ksock_recv(
>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct mpa_rr),
>> -		to_rcv + 4, MSG_DONTWAIT);
>> +		to_rcv, MSG_DONTWAIT);
>>
>> 	if (rcvd < 0)
>> 		return rcvd;
>> -- 
>> 2.23.0.rc0
>>
>>
> 
> The intention of this code is to make sure the
> peer does not violates the MPA handshake rules.
> The initiator MUST NOT send extra data after its
> MPA request and before receiving the MPA reply.

I think this is true only for MPA v2. With MPA v1, the
initiator can begin sending immediately (before receiving
the MPA reply), because there is no actual negotiation at
the MPA layer.

With MPA v2, the negotiation exchange is required because
the type of the following message is predicated by the
"Enhanced mode" a|b|c|d flags present in the first 32 bits
of the private data buffer.

So, it seems to me that additional logic is needed here to
determine the MPA version, before sniffing additional octets
from the incoming stream?

Tom.


> So, for the MPA request case, this code is needed
> to check for protocol correctness.
> You are right for the MPA reply case - if we are
> _not_ in peer2peer mode, the peer can immediately
> start sending data in RDMA mode after its MPA Reply.
> So we shall add appropriate code to be more specific
> For an error, we are (1) processing an MPA Request,
> OR (2) processing an MPA Reply AND we are not expected
> to send an initial READ/WRITE/Send as negotiated with
> the peer (peer2peer mode MPA handshake).
> 
> Just removing this check would make siw more permissive,
> but to a point where peer MPA protocol errors are
> tolerated. I am not in favor of that level of
> forgiveness.
> 
> If possible, please provide an appropriate patch
> or (if it causes current issues with another peer
> iWarp implementation) just run in MPA peer2peer mode,
> where the current check is appropriate.
> Otherwise, I would provide an appropriate fix by Monday
> (I am still out of office this week).
> 
> 
> Many thanks and best regards,
> Bernard.
> 
> 
>
Bernard Metzler Aug. 2, 2019, 11:18 a.m. UTC | #6
-----"Tom Talpey" <tom@talpey.com> wrote: -----

>To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju Eraparaju"
><krishna2@chelsio.com>
>From: "Tom Talpey" <tom@talpey.com>
>Date: 08/01/2019 08:54PM
>Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org, bharat@chelsio.com,
>nirranjan@chelsio.com, krishn2@chelsio.com
>Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
>to read beyond MPA message
>
>On 8/1/2019 6:56 AM, Bernard Metzler wrote:
>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
>> 
>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>> Date: 07/31/2019 12:34PM
>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
>Eraparaju"
>>> <krishna2@chelsio.com>
>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler tries to
>>> read beyond MPA message
>>>
>>> while processing MPA Reply, SIW driver is trying to read extra 4
>>> bytes
>>> than what peer has advertised as private data length.
>>>
>>> If a FPDU data is received before even siw_recv_mpa_rr() completed
>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr() could
>also
>>> read FPDU, if "size" is larger than advertised MPA reply length.
>>>
>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>>> 502 {
>>>           .............
>>> 572
>>> 573         if (rcvd > to_rcv)
>>> 574                 return -EPROTO;   <----- Failure here
>>>
>>> Looks like the intention here is to throw an ERROR if the received
>>> data
>>> is more than the total private data length advertised by the peer.
>>> But
>>> reading beyond MPA message causes siw_cm to generate
>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer is
>>> already
>>> queued with FPDU messages.
>>>
>>> Hence, this function should only read upto private data length.
>>>
>>> Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
>>> ---
>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>> index a7cde98e73e8..8dc8cea2566c 100644
>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct siw_cep
>*cep)
>>> 	 * A private data buffer gets allocated if hdr->params.pd_len !=
>0.
>>> 	 */
>>> 	if (!cep->mpa.pdata) {
>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>>> 		if (!cep->mpa.pdata)
>>> 			return -ENOMEM;
>>> 	}
>>> 	rcvd = ksock_recv(
>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct mpa_rr),
>>> -		to_rcv + 4, MSG_DONTWAIT);
>>> +		to_rcv, MSG_DONTWAIT);
>>>
>>> 	if (rcvd < 0)
>>> 		return rcvd;
>>> -- 
>>> 2.23.0.rc0
>>>
>>>
>> 
>> The intention of this code is to make sure the
>> peer does not violates the MPA handshake rules.
>> The initiator MUST NOT send extra data after its
>> MPA request and before receiving the MPA reply.
>
>I think this is true only for MPA v2. With MPA v1, the
>initiator can begin sending immediately (before receiving
>the MPA reply), because there is no actual negotiation at
>the MPA layer.
>
>With MPA v2, the negotiation exchange is required because
>the type of the following message is predicated by the
>"Enhanced mode" a|b|c|d flags present in the first 32 bits
>of the private data buffer.
>
>So, it seems to me that additional logic is needed here to
>determine the MPA version, before sniffing additional octets
>from the incoming stream?
>
>Tom.
>

There still is the marker negotiation taking place.
RFC 5044 says in section 7.1.2:

"Note: Since the receiver's ability to deal with Markers is
 unknown until the Request and Reply Frames have been
 received, sending FPDUs before this occurs is not possible."

This section further discusses the responder's behavior,
where it MUST receive a first FPDU from the initiator
before sending its first FPDU:

"4.  MPA Responder mode implementations MUST receive and validate at
       least one FPDU before sending any FPDUs or Markers."

So it appears with MPA version 1, the current siw
code is correct. The initiator is entering FPDU mode
first, and only after receiving the MPA reply frame.
Only after the initiator sent a first FPDU, the responder
can start using the connection in FPDU mode.
Because of this somehow broken connection establishment
procedure (only the initiator can start sending data), a
later MPA version makes this first FPDU exchange explicit
and selectable (zero length READ/WRITE/Send).

Bernard.

>
>> So, for the MPA request case, this code is needed
>> to check for protocol correctness.
>> You are right for the MPA reply case - if we are
>> _not_ in peer2peer mode, the peer can immediately
>> start sending data in RDMA mode after its MPA Reply.
>> So we shall add appropriate code to be more specific
>> For an error, we are (1) processing an MPA Request,
>> OR (2) processing an MPA Reply AND we are not expected
>> to send an initial READ/WRITE/Send as negotiated with
>> the peer (peer2peer mode MPA handshake).
>> 
>> Just removing this check would make siw more permissive,
>> but to a point where peer MPA protocol errors are
>> tolerated. I am not in favor of that level of
>> forgiveness.
>> 
>> If possible, please provide an appropriate patch
>> or (if it causes current issues with another peer
>> iWarp implementation) just run in MPA peer2peer mode,
>> where the current check is appropriate.
>> Otherwise, I would provide an appropriate fix by Monday
>> (I am still out of office this week).
>> 
>> 
>> Many thanks and best regards,
>> Bernard.
>> 
>> 
>> 
>
>
Tom Talpey Aug. 2, 2019, 12:47 p.m. UTC | #7
On 8/2/2019 7:18 AM, Bernard Metzler wrote:
> -----"Tom Talpey" <tom@talpey.com> wrote: -----
> 
>> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju Eraparaju"
>> <krishna2@chelsio.com>
>> From: "Tom Talpey" <tom@talpey.com>
>> Date: 08/01/2019 08:54PM
>> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org, bharat@chelsio.com,
>> nirranjan@chelsio.com, krishn2@chelsio.com
>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
>> to read beyond MPA message
>>
>> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
>>>
>>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>>> Date: 07/31/2019 12:34PM
>>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
>> Eraparaju"
>>>> <krishna2@chelsio.com>
>>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler tries to
>>>> read beyond MPA message
>>>>
>>>> while processing MPA Reply, SIW driver is trying to read extra 4
>>>> bytes
>>>> than what peer has advertised as private data length.
>>>>
>>>> If a FPDU data is received before even siw_recv_mpa_rr() completed
>>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr() could
>> also
>>>> read FPDU, if "size" is larger than advertised MPA reply length.
>>>>
>>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>>>> 502 {
>>>>            .............
>>>> 572
>>>> 573         if (rcvd > to_rcv)
>>>> 574                 return -EPROTO;   <----- Failure here
>>>>
>>>> Looks like the intention here is to throw an ERROR if the received
>>>> data
>>>> is more than the total private data length advertised by the peer.
>>>> But
>>>> reading beyond MPA message causes siw_cm to generate
>>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer is
>>>> already
>>>> queued with FPDU messages.
>>>>
>>>> Hence, this function should only read upto private data length.
>>>>
>>>> Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
>>>> ---
>>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>>> index a7cde98e73e8..8dc8cea2566c 100644
>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct siw_cep
>> *cep)
>>>> 	 * A private data buffer gets allocated if hdr->params.pd_len !=
>> 0.
>>>> 	 */
>>>> 	if (!cep->mpa.pdata) {
>>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>>>> 		if (!cep->mpa.pdata)
>>>> 			return -ENOMEM;
>>>> 	}
>>>> 	rcvd = ksock_recv(
>>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct mpa_rr),
>>>> -		to_rcv + 4, MSG_DONTWAIT);
>>>> +		to_rcv, MSG_DONTWAIT);
>>>>
>>>> 	if (rcvd < 0)
>>>> 		return rcvd;
>>>> -- 
>>>> 2.23.0.rc0
>>>>
>>>>
>>>
>>> The intention of this code is to make sure the
>>> peer does not violates the MPA handshake rules.
>>> The initiator MUST NOT send extra data after its
>>> MPA request and before receiving the MPA reply.
>>
>> I think this is true only for MPA v2. With MPA v1, the
>> initiator can begin sending immediately (before receiving
>> the MPA reply), because there is no actual negotiation at
>> the MPA layer.
>>
>> With MPA v2, the negotiation exchange is required because
>> the type of the following message is predicated by the
>> "Enhanced mode" a|b|c|d flags present in the first 32 bits
>> of the private data buffer.
>>
>> So, it seems to me that additional logic is needed here to
>> determine the MPA version, before sniffing additional octets
>>from the incoming stream?
>>
>> Tom.
>>
> 
> There still is the marker negotiation taking place.
> RFC 5044 says in section 7.1.2:
> 
> "Note: Since the receiver's ability to deal with Markers is
>   unknown until the Request and Reply Frames have been
>   received, sending FPDUs before this occurs is not possible."
> 
> This section further discusses the responder's behavior,
> where it MUST receive a first FPDU from the initiator
> before sending its first FPDU:
> 
> "4.  MPA Responder mode implementations MUST receive and validate at
>         least one FPDU before sending any FPDUs or Markers."
> 
> So it appears with MPA version 1, the current siw
> code is correct. The initiator is entering FPDU mode
> first, and only after receiving the MPA reply frame.
> Only after the initiator sent a first FPDU, the responder
> can start using the connection in FPDU mode.
> Because of this somehow broken connection establishment
> procedure (only the initiator can start sending data), a
> later MPA version makes this first FPDU exchange explicit
> and selectable (zero length READ/WRITE/Send).

Yeah, I guess so. Because nobody ever actually implemented markers,
I think that they may more or less passively ignore this. But you're
currect that it's invalid protocol behavior.

If your testing didn't uncover any issues with existing implementations
failing to connect with your strict checking, I'm ok with it.

Tom.



> 
> Bernard.
> 
>>
>>> So, for the MPA request case, this code is needed
>>> to check for protocol correctness.
>>> You are right for the MPA reply case - if we are
>>> _not_ in peer2peer mode, the peer can immediately
>>> start sending data in RDMA mode after its MPA Reply.
>>> So we shall add appropriate code to be more specific
>>> For an error, we are (1) processing an MPA Request,
>>> OR (2) processing an MPA Reply AND we are not expected
>>> to send an initial READ/WRITE/Send as negotiated with
>>> the peer (peer2peer mode MPA handshake).
>>>
>>> Just removing this check would make siw more permissive,
>>> but to a point where peer MPA protocol errors are
>>> tolerated. I am not in favor of that level of
>>> forgiveness.
>>>
>>> If possible, please provide an appropriate patch
>>> or (if it causes current issues with another peer
>>> iWarp implementation) just run in MPA peer2peer mode,
>>> where the current check is appropriate.
>>> Otherwise, I would provide an appropriate fix by Monday
>>> (I am still out of office this week).
>>>
>>>
>>> Many thanks and best regards,
>>> Bernard.
>>>
>>>
>>>
>>
>>
> 
> 
>
Bernard Metzler Aug. 8, 2019, 3:05 p.m. UTC | #8
-----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----

>To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
>From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>Date: 08/05/2019 07:26PM
>Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
><linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
><bharat@chelsio.com>, "Nirranjan Kirubaharan" <nirranjan@chelsio.com>
>Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
>to read beyond MPA message
>
>On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom Talpey wrote:
>> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
>> > -----"Tom Talpey" <tom@talpey.com> wrote: -----
>> > 
>> >> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
>Eraparaju"
>> >> <krishna2@chelsio.com>
>> >> From: "Tom Talpey" <tom@talpey.com>
>> >> Date: 08/01/2019 08:54PM
>> >> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
>bharat@chelsio.com,
>> >> nirranjan@chelsio.com, krishn2@chelsio.com
>> >> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>tries
>> >> to read beyond MPA message
>> >>
>> >> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
>> >>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
>-----
>> >>>
>> >>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>> >>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>> >>>> Date: 07/31/2019 12:34PM
>> >>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>> >>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
>> >> Eraparaju"
>> >>>> <krishna2@chelsio.com>
>> >>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler
>tries to
>> >>>> read beyond MPA message
>> >>>>
>> >>>> while processing MPA Reply, SIW driver is trying to read extra
>4
>> >>>> bytes
>> >>>> than what peer has advertised as private data length.
>> >>>>
>> >>>> If a FPDU data is received before even siw_recv_mpa_rr()
>completed
>> >>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr()
>could
>> >> also
>> >>>> read FPDU, if "size" is larger than advertised MPA reply
>length.
>> >>>>
>> >>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>> >>>> 502 {
>> >>>>            .............
>> >>>> 572
>> >>>> 573         if (rcvd > to_rcv)
>> >>>> 574                 return -EPROTO;   <----- Failure here
>> >>>>
>> >>>> Looks like the intention here is to throw an ERROR if the
>received
>> >>>> data
>> >>>> is more than the total private data length advertised by the
>peer.
>> >>>> But
>> >>>> reading beyond MPA message causes siw_cm to generate
>> >>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer
>is
>> >>>> already
>> >>>> queued with FPDU messages.
>> >>>>
>> >>>> Hence, this function should only read upto private data
>length.
>> >>>>
>> >>>> Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
>> >>>> ---
>> >>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>> >>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> >>>>
>> >>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>> >>>> b/drivers/infiniband/sw/siw/siw_cm.c
>> >>>> index a7cde98e73e8..8dc8cea2566c 100644
>> >>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>> >>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>> >>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
>siw_cep
>> >> *cep)
>> >>>> 	 * A private data buffer gets allocated if hdr->params.pd_len
>!=
>> >> 0.
>> >>>> 	 */
>> >>>> 	if (!cep->mpa.pdata) {
>> >>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>> >>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>> >>>> 		if (!cep->mpa.pdata)
>> >>>> 			return -ENOMEM;
>> >>>> 	}
>> >>>> 	rcvd = ksock_recv(
>> >>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct
>mpa_rr),
>> >>>> -		to_rcv + 4, MSG_DONTWAIT);
>> >>>> +		to_rcv, MSG_DONTWAIT);
>> >>>>
>> >>>> 	if (rcvd < 0)
>> >>>> 		return rcvd;
>> >>>> -- 
>> >>>> 2.23.0.rc0
>> >>>>
>> >>>>
>> >>>
>> >>> The intention of this code is to make sure the
>> >>> peer does not violates the MPA handshake rules.
>> >>> The initiator MUST NOT send extra data after its
>> >>> MPA request and before receiving the MPA reply.
>> >>
>> >> I think this is true only for MPA v2. With MPA v1, the
>> >> initiator can begin sending immediately (before receiving
>> >> the MPA reply), because there is no actual negotiation at
>> >> the MPA layer.
>> >>
>> >> With MPA v2, the negotiation exchange is required because
>> >> the type of the following message is predicated by the
>> >> "Enhanced mode" a|b|c|d flags present in the first 32 bits
>> >> of the private data buffer.
>> >>
>> >> So, it seems to me that additional logic is needed here to
>> >> determine the MPA version, before sniffing additional octets
>> >>from the incoming stream?
>> >>
>> >> Tom.
>> >>
>> > 
>> > There still is the marker negotiation taking place.
>> > RFC 5044 says in section 7.1.2:
>> > 
>> > "Note: Since the receiver's ability to deal with Markers is
>> >   unknown until the Request and Reply Frames have been
>> >   received, sending FPDUs before this occurs is not possible."
>> > 
>> > This section further discusses the responder's behavior,
>> > where it MUST receive a first FPDU from the initiator
>> > before sending its first FPDU:
>> > 
>> > "4.  MPA Responder mode implementations MUST receive and validate
>at
>> >         least one FPDU before sending any FPDUs or Markers."
>> > 
>> > So it appears with MPA version 1, the current siw
>> > code is correct. The initiator is entering FPDU mode
>> > first, and only after receiving the MPA reply frame.
>> > Only after the initiator sent a first FPDU, the responder
>> > can start using the connection in FPDU mode.
>> > Because of this somehow broken connection establishment
>> > procedure (only the initiator can start sending data), a
>> > later MPA version makes this first FPDU exchange explicit
>> > and selectable (zero length READ/WRITE/Send).
>> 
>> Yeah, I guess so. Because nobody ever actually implemented markers,
>> I think that they may more or less passively ignore this. But
>you're
>> currect that it's invalid protocol behavior.
>> 
>> If your testing didn't uncover any issues with existing
>implementations
>> failing to connect with your strict checking, I'm ok with it.
>Tom & Bernard,
>Thanks for the insight on MPA negotiation.
>
>Could the below patch be considered as a proper fix?
>
>diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>b/drivers/infiniband/sw/siw/siw_cm.c
>index 9ce8a1b925d2..0aec1b5212f9 100644
>--- a/drivers/infiniband/sw/siw/siw_cm.c
>+++ b/drivers/infiniband/sw/siw/siw_cm.c
>@@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct siw_cep *cep)
>        struct socket *s = cep->sock;
>        u16 pd_len;
>        int rcvd, to_rcv;
>+       int extra_data_check = 4; /* 4Bytes, for MPA rules violation
>checking */
>
>        if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
>                rcvd = ksock_recv(s, (char *)hdr +
>cep->mpa.bytes_rcvd,
>@@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct siw_cep *cep)
>                return -EPROTO;
>        }
>
>+       /*
>+        * Peer must not send any extra data other than MPA messages
>until MPA
>+        * negotiation is completed, an exception is MPA V2
>client-server Mode,
>+        * IE, in this mode the peer after sending MPA Reply can
>immediately
>+        * start sending data in RDMA mode.
>+        */

This is unfortunately not true. The responder NEVER sends
an FPDU without having seen an FPDU from the initiator.
I just checked RFC 6581 again. The RTR (ready-to-receive)
indication from the initiator is still needed, but now
provided by the protocol and not the application: w/o
'enhanced MPA setup', the initiator sends the first
FPDU as an application message. With 'enhanced MPA setup',
the initiator protocol entity sends (w/o application
interaction) a zero length READ/WRITE/Send as a first FPDU,
as previously negotiated with the responder. Again: only
after the responder has seen the first FPDU, it can
start sending in FPDU mode.

Sorry for the confusion. But the current
siw code appears to be just correct.

Thanks
Bernard



>+       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
>MPA_REVISION_2) &&
>+               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
>+               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord &
>+                               (MPA_V2_RDMA_WRITE_RTR |
>MPA_V2_RDMA_READ_RTR);
>+               if (!mpa_p2p_mode)
>+                       extra_data_check = 0;
>+       }
>+
>        /*
>         * At this point, we must have hdr->params.pd_len != 0.
>         * A private data buffer gets allocated if hdr->params.pd_len
>!=
>         * 0.
>         */
>        if (!cep->mpa.pdata) {
>-               cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>+               cep->mpa.pdata = kmalloc(pd_len + extra_data_check,
>GFP_KERNEL);
>                if (!cep->mpa.pdata)
>                        return -ENOMEM;
>        }
>        rcvd = ksock_recv(
>                s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
>sizeof(struct
>mpa_rr),
>-               to_rcv + 4, MSG_DONTWAIT);
>+               to_rcv + extra_data_check, MSG_DONTWAIT);
>
>        if (rcvd < 0)
>                return rcvd;
>
>-       if (rcvd > to_rcv)
>+       if (extra_data_check && (rcvd > to_rcv))
>                return -EPROTO;
>
>        cep->mpa.bytes_rcvd += rcvd;
>
>-Krishna.
>> 
>> Tom.
>> 
>> 
>> 
>> > 
>> > Bernard.
>> > 
>> >>
>> >>> So, for the MPA request case, this code is needed
>> >>> to check for protocol correctness.
>> >>> You are right for the MPA reply case - if we are
>> >>> _not_ in peer2peer mode, the peer can immediately
>> >>> start sending data in RDMA mode after its MPA Reply.
>> >>> So we shall add appropriate code to be more specific
>> >>> For an error, we are (1) processing an MPA Request,
>> >>> OR (2) processing an MPA Reply AND we are not expected
>> >>> to send an initial READ/WRITE/Send as negotiated with
>> >>> the peer (peer2peer mode MPA handshake).
>> >>>
>> >>> Just removing this check would make siw more permissive,
>> >>> but to a point where peer MPA protocol errors are
>> >>> tolerated. I am not in favor of that level of
>> >>> forgiveness.
>> >>>
>> >>> If possible, please provide an appropriate patch
>> >>> or (if it causes current issues with another peer
>> >>> iWarp implementation) just run in MPA peer2peer mode,
>> >>> where the current check is appropriate.
>> >>> Otherwise, I would provide an appropriate fix by Monday
>> >>> (I am still out of office this week).
>> >>>
>> >>>
>> >>> Many thanks and best regards,
>> >>> Bernard.
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> > 
>> > 
>> > 
>
>
Krishnamraju Eraparaju Aug. 8, 2019, 4:46 p.m. UTC | #9
On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard Metzler wrote:
> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
> 
> >To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
> >From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >Date: 08/05/2019 07:26PM
> >Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
> ><linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
> ><bharat@chelsio.com>, "Nirranjan Kirubaharan" <nirranjan@chelsio.com>
> >Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
> >to read beyond MPA message
> >
> >On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom Talpey wrote:
> >> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
> >> > -----"Tom Talpey" <tom@talpey.com> wrote: -----
> >> > 
> >> >> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
> >Eraparaju"
> >> >> <krishna2@chelsio.com>
> >> >> From: "Tom Talpey" <tom@talpey.com>
> >> >> Date: 08/01/2019 08:54PM
> >> >> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
> >bharat@chelsio.com,
> >> >> nirranjan@chelsio.com, krishn2@chelsio.com
> >> >> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
> >tries
> >> >> to read beyond MPA message
> >> >>
> >> >> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
> >> >>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
> >-----
> >> >>>
> >> >>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
> >> >>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >> >>>> Date: 07/31/2019 12:34PM
> >> >>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
> >> >>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
> >> >> Eraparaju"
> >> >>>> <krishna2@chelsio.com>
> >> >>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler
> >tries to
> >> >>>> read beyond MPA message
> >> >>>>
> >> >>>> while processing MPA Reply, SIW driver is trying to read extra
> >4
> >> >>>> bytes
> >> >>>> than what peer has advertised as private data length.
> >> >>>>
> >> >>>> If a FPDU data is received before even siw_recv_mpa_rr()
> >completed
> >> >>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr()
> >could
> >> >> also
> >> >>>> read FPDU, if "size" is larger than advertised MPA reply
> >length.
> >> >>>>
> >> >>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
> >> >>>> 502 {
> >> >>>>            .............
> >> >>>> 572
> >> >>>> 573         if (rcvd > to_rcv)
> >> >>>> 574                 return -EPROTO;   <----- Failure here
> >> >>>>
> >> >>>> Looks like the intention here is to throw an ERROR if the
> >received
> >> >>>> data
> >> >>>> is more than the total private data length advertised by the
> >peer.
> >> >>>> But
> >> >>>> reading beyond MPA message causes siw_cm to generate
> >> >>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer
> >is
> >> >>>> already
> >> >>>> queued with FPDU messages.
> >> >>>>
> >> >>>> Hence, this function should only read upto private data
> >length.
> >> >>>>
> >> >>>> Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
> >> >>>> ---
> >> >>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
> >> >>>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >> >>>>
> >> >>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>> b/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>> index a7cde98e73e8..8dc8cea2566c 100644
> >> >>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
> >siw_cep
> >> >> *cep)
> >> >>>> 	 * A private data buffer gets allocated if hdr->params.pd_len
> >!=
> >> >> 0.
> >> >>>> 	 */
> >> >>>> 	if (!cep->mpa.pdata) {
> >> >>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
> >> >>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
> >> >>>> 		if (!cep->mpa.pdata)
> >> >>>> 			return -ENOMEM;
> >> >>>> 	}
> >> >>>> 	rcvd = ksock_recv(
> >> >>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct
> >mpa_rr),
> >> >>>> -		to_rcv + 4, MSG_DONTWAIT);
> >> >>>> +		to_rcv, MSG_DONTWAIT);
> >> >>>>
> >> >>>> 	if (rcvd < 0)
> >> >>>> 		return rcvd;
> >> >>>> -- 
> >> >>>> 2.23.0.rc0
> >> >>>>
> >> >>>>
> >> >>>
> >> >>> The intention of this code is to make sure the
> >> >>> peer does not violates the MPA handshake rules.
> >> >>> The initiator MUST NOT send extra data after its
> >> >>> MPA request and before receiving the MPA reply.
> >> >>
> >> >> I think this is true only for MPA v2. With MPA v1, the
> >> >> initiator can begin sending immediately (before receiving
> >> >> the MPA reply), because there is no actual negotiation at
> >> >> the MPA layer.
> >> >>
> >> >> With MPA v2, the negotiation exchange is required because
> >> >> the type of the following message is predicated by the
> >> >> "Enhanced mode" a|b|c|d flags present in the first 32 bits
> >> >> of the private data buffer.
> >> >>
> >> >> So, it seems to me that additional logic is needed here to
> >> >> determine the MPA version, before sniffing additional octets
> >> >>from the incoming stream?
> >> >>
> >> >> Tom.
> >> >>
> >> > 
> >> > There still is the marker negotiation taking place.
> >> > RFC 5044 says in section 7.1.2:
> >> > 
> >> > "Note: Since the receiver's ability to deal with Markers is
> >> >   unknown until the Request and Reply Frames have been
> >> >   received, sending FPDUs before this occurs is not possible."
> >> > 
> >> > This section further discusses the responder's behavior,
> >> > where it MUST receive a first FPDU from the initiator
> >> > before sending its first FPDU:
> >> > 
> >> > "4.  MPA Responder mode implementations MUST receive and validate
> >at
> >> >         least one FPDU before sending any FPDUs or Markers."
> >> > 
> >> > So it appears with MPA version 1, the current siw
> >> > code is correct. The initiator is entering FPDU mode
> >> > first, and only after receiving the MPA reply frame.
> >> > Only after the initiator sent a first FPDU, the responder
> >> > can start using the connection in FPDU mode.
> >> > Because of this somehow broken connection establishment
> >> > procedure (only the initiator can start sending data), a
> >> > later MPA version makes this first FPDU exchange explicit
> >> > and selectable (zero length READ/WRITE/Send).
> >> 
> >> Yeah, I guess so. Because nobody ever actually implemented markers,
> >> I think that they may more or less passively ignore this. But
> >you're
> >> currect that it's invalid protocol behavior.
> >> 
> >> If your testing didn't uncover any issues with existing
> >implementations
> >> failing to connect with your strict checking, I'm ok with it.
> >Tom & Bernard,
> >Thanks for the insight on MPA negotiation.
> >
> >Could the below patch be considered as a proper fix?
> >
> >diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> >b/drivers/infiniband/sw/siw/siw_cm.c
> >index 9ce8a1b925d2..0aec1b5212f9 100644
> >--- a/drivers/infiniband/sw/siw/siw_cm.c
> >+++ b/drivers/infiniband/sw/siw/siw_cm.c
> >@@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct siw_cep *cep)
> >        struct socket *s = cep->sock;
> >        u16 pd_len;
> >        int rcvd, to_rcv;
> >+       int extra_data_check = 4; /* 4Bytes, for MPA rules violation
> >checking */
> >
> >        if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
> >                rcvd = ksock_recv(s, (char *)hdr +
> >cep->mpa.bytes_rcvd,
> >@@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct siw_cep *cep)
> >                return -EPROTO;
> >        }
> >
> >+       /*
> >+        * Peer must not send any extra data other than MPA messages
> >until MPA
> >+        * negotiation is completed, an exception is MPA V2
> >client-server Mode,
> >+        * IE, in this mode the peer after sending MPA Reply can
> >immediately
> >+        * start sending data in RDMA mode.
> >+        */
> 
> This is unfortunately not true. The responder NEVER sends
> an FPDU without having seen an FPDU from the initiator.
> I just checked RFC 6581 again. The RTR (ready-to-receive)
> indication from the initiator is still needed, but now
> provided by the protocol and not the application: w/o
> 'enhanced MPA setup', the initiator sends the first
> FPDU as an application message. With 'enhanced MPA setup',
> the initiator protocol entity sends (w/o application
> interaction) a zero length READ/WRITE/Send as a first FPDU,
> as previously negotiated with the responder. Again: only
> after the responder has seen the first FPDU, it can
> start sending in FPDU mode.
If I understand your statements correctly: in MPA v2 clint-server mode,
the responder application should have some logic to wait(after ESTABLISH
event) until the first FPDU message is received from the initiator?

Thanks,
Krishna.
> 
> Sorry for the confusion. But the current
> siw code appears to be just correct.
> 
> Thanks
> Bernard
> 
> 
> 
> >+       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
> >MPA_REVISION_2) &&
> >+               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
> >+               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord &
> >+                               (MPA_V2_RDMA_WRITE_RTR |
> >MPA_V2_RDMA_READ_RTR);
> >+               if (!mpa_p2p_mode)
> >+                       extra_data_check = 0;
> >+       }
> >+
> >        /*
> >         * At this point, we must have hdr->params.pd_len != 0.
> >         * A private data buffer gets allocated if hdr->params.pd_len
> >!=
> >         * 0.
> >         */
> >        if (!cep->mpa.pdata) {
> >-               cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
> >+               cep->mpa.pdata = kmalloc(pd_len + extra_data_check,
> >GFP_KERNEL);
> >                if (!cep->mpa.pdata)
> >                        return -ENOMEM;
> >        }
> >        rcvd = ksock_recv(
> >                s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
> >sizeof(struct
> >mpa_rr),
> >-               to_rcv + 4, MSG_DONTWAIT);
> >+               to_rcv + extra_data_check, MSG_DONTWAIT);
> >
> >        if (rcvd < 0)
> >                return rcvd;
> >
> >-       if (rcvd > to_rcv)
> >+       if (extra_data_check && (rcvd > to_rcv))
> >                return -EPROTO;
> >
> >        cep->mpa.bytes_rcvd += rcvd;
> >
> >-Krishna.
> >> 
> >> Tom.
> >> 
> >> 
> >> 
> >> > 
> >> > Bernard.
> >> > 
> >> >>
> >> >>> So, for the MPA request case, this code is needed
> >> >>> to check for protocol correctness.
> >> >>> You are right for the MPA reply case - if we are
> >> >>> _not_ in peer2peer mode, the peer can immediately
> >> >>> start sending data in RDMA mode after its MPA Reply.
> >> >>> So we shall add appropriate code to be more specific
> >> >>> For an error, we are (1) processing an MPA Request,
> >> >>> OR (2) processing an MPA Reply AND we are not expected
> >> >>> to send an initial READ/WRITE/Send as negotiated with
> >> >>> the peer (peer2peer mode MPA handshake).
> >> >>>
> >> >>> Just removing this check would make siw more permissive,
> >> >>> but to a point where peer MPA protocol errors are
> >> >>> tolerated. I am not in favor of that level of
> >> >>> forgiveness.
> >> >>>
> >> >>> If possible, please provide an appropriate patch
> >> >>> or (if it causes current issues with another peer
> >> >>> iWarp implementation) just run in MPA peer2peer mode,
> >> >>> where the current check is appropriate.
> >> >>> Otherwise, I would provide an appropriate fix by Monday
> >> >>> (I am still out of office this week).
> >> >>>
> >> >>>
> >> >>> Many thanks and best regards,
> >> >>> Bernard.
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> > 
> >> > 
> >> > 
> >
> >
>
Bernard Metzler Aug. 9, 2019, 10:29 a.m. UTC | #10
-----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----

>To: "Bernard Metzler" <BMT@zurich.ibm.com>
>From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>Date: 08/08/2019 06:47PM
>Cc: "Tom Talpey" <tom@talpey.com>, "jgg@ziepe.ca" <jgg@ziepe.ca>,
>"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>, "Potnuri
>Bharat Teja" <bharat@chelsio.com>, "Nirranjan Kirubaharan"
><nirranjan@chelsio.com>
>Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
>to read beyond MPA message
>
>On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard Metzler
>wrote:
>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
>> 
>> >To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
>> >From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>> >Date: 08/05/2019 07:26PM
>> >Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
>> ><linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
>> ><bharat@chelsio.com>, "Nirranjan Kirubaharan"
><nirranjan@chelsio.com>
>> >Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>tries
>> >to read beyond MPA message
>> >
>> >On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom Talpey
>wrote:
>> >> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
>> >> > -----"Tom Talpey" <tom@talpey.com> wrote: -----
>> >> > 
>> >> >> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
>> >Eraparaju"
>> >> >> <krishna2@chelsio.com>
>> >> >> From: "Tom Talpey" <tom@talpey.com>
>> >> >> Date: 08/01/2019 08:54PM
>> >> >> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
>> >bharat@chelsio.com,
>> >> >> nirranjan@chelsio.com, krishn2@chelsio.com
>> >> >> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>> >tries
>> >> >> to read beyond MPA message
>> >> >>
>> >> >> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
>> >> >>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
>> >-----
>> >> >>>
>> >> >>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>> >> >>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>> >> >>>> Date: 07/31/2019 12:34PM
>> >> >>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>> >> >>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
>> >> >> Eraparaju"
>> >> >>>> <krishna2@chelsio.com>
>> >> >>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler
>> >tries to
>> >> >>>> read beyond MPA message
>> >> >>>>
>> >> >>>> while processing MPA Reply, SIW driver is trying to read
>extra
>> >4
>> >> >>>> bytes
>> >> >>>> than what peer has advertised as private data length.
>> >> >>>>
>> >> >>>> If a FPDU data is received before even siw_recv_mpa_rr()
>> >completed
>> >> >>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr()
>> >could
>> >> >> also
>> >> >>>> read FPDU, if "size" is larger than advertised MPA reply
>> >length.
>> >> >>>>
>> >> >>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>> >> >>>> 502 {
>> >> >>>>            .............
>> >> >>>> 572
>> >> >>>> 573         if (rcvd > to_rcv)
>> >> >>>> 574                 return -EPROTO;   <----- Failure here
>> >> >>>>
>> >> >>>> Looks like the intention here is to throw an ERROR if the
>> >received
>> >> >>>> data
>> >> >>>> is more than the total private data length advertised by
>the
>> >peer.
>> >> >>>> But
>> >> >>>> reading beyond MPA message causes siw_cm to generate
>> >> >>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv
>buffer
>> >is
>> >> >>>> already
>> >> >>>> queued with FPDU messages.
>> >> >>>>
>> >> >>>> Hence, this function should only read upto private data
>> >length.
>> >> >>>>
>> >> >>>> Signed-off-by: Krishnamraju Eraparaju
><krishna2@chelsio.com>
>> >> >>>> ---
>> >> >>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>> >> >>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> >> >>>>
>> >> >>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>> >> >>>> b/drivers/infiniband/sw/siw/siw_cm.c
>> >> >>>> index a7cde98e73e8..8dc8cea2566c 100644
>> >> >>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>> >> >>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>> >> >>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
>> >siw_cep
>> >> >> *cep)
>> >> >>>> 	 * A private data buffer gets allocated if
>hdr->params.pd_len
>> >!=
>> >> >> 0.
>> >> >>>> 	 */
>> >> >>>> 	if (!cep->mpa.pdata) {
>> >> >>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>> >> >>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>> >> >>>> 		if (!cep->mpa.pdata)
>> >> >>>> 			return -ENOMEM;
>> >> >>>> 	}
>> >> >>>> 	rcvd = ksock_recv(
>> >> >>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct
>> >mpa_rr),
>> >> >>>> -		to_rcv + 4, MSG_DONTWAIT);
>> >> >>>> +		to_rcv, MSG_DONTWAIT);
>> >> >>>>
>> >> >>>> 	if (rcvd < 0)
>> >> >>>> 		return rcvd;
>> >> >>>> -- 
>> >> >>>> 2.23.0.rc0
>> >> >>>>
>> >> >>>>
>> >> >>>
>> >> >>> The intention of this code is to make sure the
>> >> >>> peer does not violates the MPA handshake rules.
>> >> >>> The initiator MUST NOT send extra data after its
>> >> >>> MPA request and before receiving the MPA reply.
>> >> >>
>> >> >> I think this is true only for MPA v2. With MPA v1, the
>> >> >> initiator can begin sending immediately (before receiving
>> >> >> the MPA reply), because there is no actual negotiation at
>> >> >> the MPA layer.
>> >> >>
>> >> >> With MPA v2, the negotiation exchange is required because
>> >> >> the type of the following message is predicated by the
>> >> >> "Enhanced mode" a|b|c|d flags present in the first 32 bits
>> >> >> of the private data buffer.
>> >> >>
>> >> >> So, it seems to me that additional logic is needed here to
>> >> >> determine the MPA version, before sniffing additional octets
>> >> >>from the incoming stream?
>> >> >>
>> >> >> Tom.
>> >> >>
>> >> > 
>> >> > There still is the marker negotiation taking place.
>> >> > RFC 5044 says in section 7.1.2:
>> >> > 
>> >> > "Note: Since the receiver's ability to deal with Markers is
>> >> >   unknown until the Request and Reply Frames have been
>> >> >   received, sending FPDUs before this occurs is not possible."
>> >> > 
>> >> > This section further discusses the responder's behavior,
>> >> > where it MUST receive a first FPDU from the initiator
>> >> > before sending its first FPDU:
>> >> > 
>> >> > "4.  MPA Responder mode implementations MUST receive and
>validate
>> >at
>> >> >         least one FPDU before sending any FPDUs or Markers."
>> >> > 
>> >> > So it appears with MPA version 1, the current siw
>> >> > code is correct. The initiator is entering FPDU mode
>> >> > first, and only after receiving the MPA reply frame.
>> >> > Only after the initiator sent a first FPDU, the responder
>> >> > can start using the connection in FPDU mode.
>> >> > Because of this somehow broken connection establishment
>> >> > procedure (only the initiator can start sending data), a
>> >> > later MPA version makes this first FPDU exchange explicit
>> >> > and selectable (zero length READ/WRITE/Send).
>> >> 
>> >> Yeah, I guess so. Because nobody ever actually implemented
>markers,
>> >> I think that they may more or less passively ignore this. But
>> >you're
>> >> currect that it's invalid protocol behavior.
>> >> 
>> >> If your testing didn't uncover any issues with existing
>> >implementations
>> >> failing to connect with your strict checking, I'm ok with it.
>> >Tom & Bernard,
>> >Thanks for the insight on MPA negotiation.
>> >
>> >Could the below patch be considered as a proper fix?
>> >
>> >diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>> >b/drivers/infiniband/sw/siw/siw_cm.c
>> >index 9ce8a1b925d2..0aec1b5212f9 100644
>> >--- a/drivers/infiniband/sw/siw/siw_cm.c
>> >+++ b/drivers/infiniband/sw/siw/siw_cm.c
>> >@@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct siw_cep
>*cep)
>> >        struct socket *s = cep->sock;
>> >        u16 pd_len;
>> >        int rcvd, to_rcv;
>> >+       int extra_data_check = 4; /* 4Bytes, for MPA rules
>violation
>> >checking */
>> >
>> >        if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
>> >                rcvd = ksock_recv(s, (char *)hdr +
>> >cep->mpa.bytes_rcvd,
>> >@@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct siw_cep
>*cep)
>> >                return -EPROTO;
>> >        }
>> >
>> >+       /*
>> >+        * Peer must not send any extra data other than MPA
>messages
>> >until MPA
>> >+        * negotiation is completed, an exception is MPA V2
>> >client-server Mode,
>> >+        * IE, in this mode the peer after sending MPA Reply can
>> >immediately
>> >+        * start sending data in RDMA mode.
>> >+        */
>> 
>> This is unfortunately not true. The responder NEVER sends
>> an FPDU without having seen an FPDU from the initiator.
>> I just checked RFC 6581 again. The RTR (ready-to-receive)
>> indication from the initiator is still needed, but now
>> provided by the protocol and not the application: w/o
>> 'enhanced MPA setup', the initiator sends the first
>> FPDU as an application message. With 'enhanced MPA setup',
>> the initiator protocol entity sends (w/o application
>> interaction) a zero length READ/WRITE/Send as a first FPDU,
>> as previously negotiated with the responder. Again: only
>> after the responder has seen the first FPDU, it can
>> start sending in FPDU mode.
>If I understand your statements correctly: in MPA v2 clint-server
>mode,
>the responder application should have some logic to wait(after
>ESTABLISH
>event) until the first FPDU message is received from the initiator?
>
>Thanks,
>Krishna.

The responder may either delay the established event
until it sees the first peer FPDU, or it delays sending
the first (now already posted) FPDU, until it sees the
first initiator FPDU. This, since (theoretically), the
responder does not know yet the result of CRC and
Marker negotiation. siw delays the established event,
which seems the more clear path to me.

This MPA handshake is cumbersome. siw is permissive
for violations of these rules to the extend, that the
responder shall not send back to back the first FPDU
with the MPA reply. If a (Chelsio ?) iWarp adapter
would send the first responder FPDU only after it
got TCP acked its MPA reply, the chances are high the
(siw) iWarp initiator has already transitioned to RTS
and it would accept that FPU, even if not having sent
it's first FPDU.

siw sticks to this rule to that extend, since all receive
processing is triggered by socket callbacks (from
sk->sk_data_ready()). As long as the QP is not
in RTS, all TCP bytestream processing is done by the
MPA protocol (code in siw_cm.c). If the QP reaches
RTS, RX processing is done bt the full iWarp
protocol (code in siw_qp_rx.c for RX).
Now, if the higher layer (application) protocol has
a semantic where the responder sends the first message,
we could end up in an infinite wait for that packet
at initiator application side.
This, since the complete first FPDU was already
received at the TCP socket, while the QP was not in
RTS. It will not trigger any additional sk->sk_data_ready()
and we are stuck with a FPDU in the socket rx buffer.

I implemented that 4 bytes extra data testing only to
keep the siw protocol state machine as simple as
possible (it is already unfortunately complex, if
you count the lines in siw_cm.c), while following
the protocol rules.

I suggest to correctly implement peer2peer mode and
delay the established event at the responder side until
it got the negotiated first zero length FPDU. Out
of the possible options, siw supports both zero length
WRITE and READ, but no SEND, since this would consume
an extra receive WQE.

As a last resort, I might consider extending the
siw state machine to handle those - non-complaint -
cases gracefully. But, that would result in different
code than just avoiding checking for peer protocol
violation.

Thanks,
Bernard.


>> 
>> Sorry for the confusion. But the current
>> siw code appears to be just correct.
>> 
>> Thanks
>> Bernard
>> 
>> 
>> 
>> >+       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
>> >MPA_REVISION_2) &&
>> >+               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
>> >+               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord &
>> >+                               (MPA_V2_RDMA_WRITE_RTR |
>> >MPA_V2_RDMA_READ_RTR);
>> >+               if (!mpa_p2p_mode)
>> >+                       extra_data_check = 0;
>> >+       }
>> >+
>> >        /*
>> >         * At this point, we must have hdr->params.pd_len != 0.
>> >         * A private data buffer gets allocated if
>hdr->params.pd_len
>> >!=
>> >         * 0.
>> >         */
>> >        if (!cep->mpa.pdata) {
>> >-               cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>> >+               cep->mpa.pdata = kmalloc(pd_len +
>extra_data_check,
>> >GFP_KERNEL);
>> >                if (!cep->mpa.pdata)
>> >                        return -ENOMEM;
>> >        }
>> >        rcvd = ksock_recv(
>> >                s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
>> >sizeof(struct
>> >mpa_rr),
>> >-               to_rcv + 4, MSG_DONTWAIT);
>> >+               to_rcv + extra_data_check, MSG_DONTWAIT);
>> >
>> >        if (rcvd < 0)
>> >                return rcvd;
>> >
>> >-       if (rcvd > to_rcv)
>> >+       if (extra_data_check && (rcvd > to_rcv))
>> >                return -EPROTO;
>> >
>> >        cep->mpa.bytes_rcvd += rcvd;
>> >
>> >-Krishna.
>> >> 
>> >> Tom.
>> >> 
>> >> 
>> >> 
>> >> > 
>> >> > Bernard.
>> >> > 
>> >> >>
>> >> >>> So, for the MPA request case, this code is needed
>> >> >>> to check for protocol correctness.
>> >> >>> You are right for the MPA reply case - if we are
>> >> >>> _not_ in peer2peer mode, the peer can immediately
>> >> >>> start sending data in RDMA mode after its MPA Reply.
>> >> >>> So we shall add appropriate code to be more specific
>> >> >>> For an error, we are (1) processing an MPA Request,
>> >> >>> OR (2) processing an MPA Reply AND we are not expected
>> >> >>> to send an initial READ/WRITE/Send as negotiated with
>> >> >>> the peer (peer2peer mode MPA handshake).
>> >> >>>
>> >> >>> Just removing this check would make siw more permissive,
>> >> >>> but to a point where peer MPA protocol errors are
>> >> >>> tolerated. I am not in favor of that level of
>> >> >>> forgiveness.
>> >> >>>
>> >> >>> If possible, please provide an appropriate patch
>> >> >>> or (if it causes current issues with another peer
>> >> >>> iWarp implementation) just run in MPA peer2peer mode,
>> >> >>> where the current check is appropriate.
>> >> >>> Otherwise, I would provide an appropriate fix by Monday
>> >> >>> (I am still out of office this week).
>> >> >>>
>> >> >>>
>> >> >>> Many thanks and best regards,
>> >> >>> Bernard.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> > 
>> >> > 
>> >> > 
>> >
>> >
>> 
>
>
Tom Talpey Aug. 9, 2019, 12:27 p.m. UTC | #11
On 8/9/2019 6:29 AM, Bernard Metzler wrote:
> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
> 
>> To: "Bernard Metzler" <BMT@zurich.ibm.com>
>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>> Date: 08/08/2019 06:47PM
>> Cc: "Tom Talpey" <tom@talpey.com>, "jgg@ziepe.ca" <jgg@ziepe.ca>,
>> "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>, "Potnuri
>> Bharat Teja" <bharat@chelsio.com>, "Nirranjan Kirubaharan"
>> <nirranjan@chelsio.com>
>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
>> to read beyond MPA message
>>
>> On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard Metzler
>> wrote:
>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
>>>
>>>> To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>>> Date: 08/05/2019 07:26PM
>>>> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
>>>> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
>>>> <bharat@chelsio.com>, "Nirranjan Kirubaharan"
>> <nirranjan@chelsio.com>
>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>> tries
>>>> to read beyond MPA message
>>>>
>>>> On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom Talpey
>> wrote:
>>>>> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
>>>>>> -----"Tom Talpey" <tom@talpey.com> wrote: -----
>>>>>>
>>>>>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
>>>> Eraparaju"
>>>>>>> <krishna2@chelsio.com>
>>>>>>> From: "Tom Talpey" <tom@talpey.com>
>>>>>>> Date: 08/01/2019 08:54PM
>>>>>>> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
>>>> bharat@chelsio.com,
>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com
>>>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>>>> tries
>>>>>>> to read beyond MPA message
>>>>>>>
>>>>>>> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
>>>>>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
>>>> -----
>>>>>>>>
>>>>>>>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>>>>>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>>>>>>>> Date: 07/31/2019 12:34PM
>>>>>>>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
>>>>>>> Eraparaju"
>>>>>>>>> <krishna2@chelsio.com>
>>>>>>>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler
>>>> tries to
>>>>>>>>> read beyond MPA message
>>>>>>>>>
>>>>>>>>> while processing MPA Reply, SIW driver is trying to read
>> extra
>>>> 4
>>>>>>>>> bytes
>>>>>>>>> than what peer has advertised as private data length.
>>>>>>>>>
>>>>>>>>> If a FPDU data is received before even siw_recv_mpa_rr()
>>>> completed
>>>>>>>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr()
>>>> could
>>>>>>> also
>>>>>>>>> read FPDU, if "size" is larger than advertised MPA reply
>>>> length.
>>>>>>>>>
>>>>>>>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>>>>>>>>> 502 {
>>>>>>>>>             .............
>>>>>>>>> 572
>>>>>>>>> 573         if (rcvd > to_rcv)
>>>>>>>>> 574                 return -EPROTO;   <----- Failure here
>>>>>>>>>
>>>>>>>>> Looks like the intention here is to throw an ERROR if the
>>>> received
>>>>>>>>> data
>>>>>>>>> is more than the total private data length advertised by
>> the
>>>> peer.
>>>>>>>>> But
>>>>>>>>> reading beyond MPA message causes siw_cm to generate
>>>>>>>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv
>> buffer
>>>> is
>>>>>>>>> already
>>>>>>>>> queued with FPDU messages.
>>>>>>>>>
>>>>>>>>> Hence, this function should only read upto private data
>>>> length.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Krishnamraju Eraparaju
>> <krishna2@chelsio.com>
>>>>>>>>> ---
>>>>>>>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>> index a7cde98e73e8..8dc8cea2566c 100644
>>>>>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
>>>> siw_cep
>>>>>>> *cep)
>>>>>>>>> 	 * A private data buffer gets allocated if
>> hdr->params.pd_len
>>>> !=
>>>>>>> 0.
>>>>>>>>> 	 */
>>>>>>>>> 	if (!cep->mpa.pdata) {
>>>>>>>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>>>>>>>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>>>>>>>>> 		if (!cep->mpa.pdata)
>>>>>>>>> 			return -ENOMEM;
>>>>>>>>> 	}
>>>>>>>>> 	rcvd = ksock_recv(
>>>>>>>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct
>>>> mpa_rr),
>>>>>>>>> -		to_rcv + 4, MSG_DONTWAIT);
>>>>>>>>> +		to_rcv, MSG_DONTWAIT);
>>>>>>>>>
>>>>>>>>> 	if (rcvd < 0)
>>>>>>>>> 		return rcvd;
>>>>>>>>> -- 
>>>>>>>>> 2.23.0.rc0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> The intention of this code is to make sure the
>>>>>>>> peer does not violates the MPA handshake rules.
>>>>>>>> The initiator MUST NOT send extra data after its
>>>>>>>> MPA request and before receiving the MPA reply.
>>>>>>>
>>>>>>> I think this is true only for MPA v2. With MPA v1, the
>>>>>>> initiator can begin sending immediately (before receiving
>>>>>>> the MPA reply), because there is no actual negotiation at
>>>>>>> the MPA layer.
>>>>>>>
>>>>>>> With MPA v2, the negotiation exchange is required because
>>>>>>> the type of the following message is predicated by the
>>>>>>> "Enhanced mode" a|b|c|d flags present in the first 32 bits
>>>>>>> of the private data buffer.
>>>>>>>
>>>>>>> So, it seems to me that additional logic is needed here to
>>>>>>> determine the MPA version, before sniffing additional octets
>>>>>> >from the incoming stream?
>>>>>>>
>>>>>>> Tom.
>>>>>>>
>>>>>>
>>>>>> There still is the marker negotiation taking place.
>>>>>> RFC 5044 says in section 7.1.2:
>>>>>>
>>>>>> "Note: Since the receiver's ability to deal with Markers is
>>>>>>    unknown until the Request and Reply Frames have been
>>>>>>    received, sending FPDUs before this occurs is not possible."
>>>>>>
>>>>>> This section further discusses the responder's behavior,
>>>>>> where it MUST receive a first FPDU from the initiator
>>>>>> before sending its first FPDU:
>>>>>>
>>>>>> "4.  MPA Responder mode implementations MUST receive and
>> validate
>>>> at
>>>>>>          least one FPDU before sending any FPDUs or Markers."
>>>>>>
>>>>>> So it appears with MPA version 1, the current siw
>>>>>> code is correct. The initiator is entering FPDU mode
>>>>>> first, and only after receiving the MPA reply frame.
>>>>>> Only after the initiator sent a first FPDU, the responder
>>>>>> can start using the connection in FPDU mode.
>>>>>> Because of this somehow broken connection establishment
>>>>>> procedure (only the initiator can start sending data), a
>>>>>> later MPA version makes this first FPDU exchange explicit
>>>>>> and selectable (zero length READ/WRITE/Send).
>>>>>
>>>>> Yeah, I guess so. Because nobody ever actually implemented
>> markers,
>>>>> I think that they may more or less passively ignore this. But
>>>> you're
>>>>> currect that it's invalid protocol behavior.
>>>>>
>>>>> If your testing didn't uncover any issues with existing
>>>> implementations
>>>>> failing to connect with your strict checking, I'm ok with it.
>>>> Tom & Bernard,
>>>> Thanks for the insight on MPA negotiation.
>>>>
>>>> Could the below patch be considered as a proper fix?
>>>>
>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>>> index 9ce8a1b925d2..0aec1b5212f9 100644
>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>>> @@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct siw_cep
>> *cep)
>>>>         struct socket *s = cep->sock;
>>>>         u16 pd_len;
>>>>         int rcvd, to_rcv;
>>>> +       int extra_data_check = 4; /* 4Bytes, for MPA rules
>> violation
>>>> checking */
>>>>
>>>>         if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
>>>>                 rcvd = ksock_recv(s, (char *)hdr +
>>>> cep->mpa.bytes_rcvd,
>>>> @@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct siw_cep
>> *cep)
>>>>                 return -EPROTO;
>>>>         }
>>>>
>>>> +       /*
>>>> +        * Peer must not send any extra data other than MPA
>> messages
>>>> until MPA
>>>> +        * negotiation is completed, an exception is MPA V2
>>>> client-server Mode,
>>>> +        * IE, in this mode the peer after sending MPA Reply can
>>>> immediately
>>>> +        * start sending data in RDMA mode.
>>>> +        */
>>>
>>> This is unfortunately not true. The responder NEVER sends
>>> an FPDU without having seen an FPDU from the initiator.
>>> I just checked RFC 6581 again. The RTR (ready-to-receive)
>>> indication from the initiator is still needed, but now
>>> provided by the protocol and not the application: w/o
>>> 'enhanced MPA setup', the initiator sends the first
>>> FPDU as an application message. With 'enhanced MPA setup',
>>> the initiator protocol entity sends (w/o application
>>> interaction) a zero length READ/WRITE/Send as a first FPDU,
>>> as previously negotiated with the responder. Again: only
>>> after the responder has seen the first FPDU, it can
>>> start sending in FPDU mode.
>> If I understand your statements correctly: in MPA v2 clint-server
>> mode,
>> the responder application should have some logic to wait(after
>> ESTABLISH
>> event) until the first FPDU message is received from the initiator?
>>
>> Thanks,
>> Krishna.
> 
> The responder may either delay the established event
> until it sees the first peer FPDU, or it delays sending
> the first (now already posted) FPDU, until it sees the
> first initiator FPDU. This, since (theoretically), the
> responder does not know yet the result of CRC and
> Marker negotiation. siw delays the established event,
> which seems the more clear path to me.
> 
> This MPA handshake is cumbersome. siw is permissive
> for violations of these rules to the extend, that the
> responder shall not send back to back the first FPDU
> with the MPA reply. If a (Chelsio ?) iWarp adapter
> would send the first responder FPDU only after it
> got TCP acked its MPA reply, the chances are high the
> (siw) iWarp initiator has already transitioned to RTS
> and it would accept that FPU, even if not having sent
> it's first FPDU.
> 
> siw sticks to this rule to that extend, since all receive
> processing is triggered by socket callbacks (from
> sk->sk_data_ready()). As long as the QP is not
> in RTS, all TCP bytestream processing is done by the
> MPA protocol (code in siw_cm.c). If the QP reaches
> RTS, RX processing is done bt the full iWarp
> protocol (code in siw_qp_rx.c for RX).
> Now, if the higher layer (application) protocol has
> a semantic where the responder sends the first message,
> we could end up in an infinite wait for that packet
> at initiator application side.
> This, since the complete first FPDU was already
> received at the TCP socket, while the QP was not in
> RTS. It will not trigger any additional sk->sk_data_ready()
> and we are stuck with a FPDU in the socket rx buffer.
> 
> I implemented that 4 bytes extra data testing only to
> keep the siw protocol state machine as simple as
> possible (it is already unfortunately complex, if
> you count the lines in siw_cm.c), while following
> the protocol rules.
> 
> I suggest to correctly implement peer2peer mode and
> delay the established event at the responder side until
> it got the negotiated first zero length FPDU. Out
> of the possible options, siw supports both zero length
> WRITE and READ, but no SEND, since this would consume
> an extra receive WQE.
> 
> As a last resort, I might consider extending the
> siw state machine to handle those - non-complaint -
> cases gracefully. But, that would result in different
> code than just avoiding checking for peer protocol
> violation.

Bernard, everything you describe in siw seems perfectly valid
to me, and in fact that's why the MPA enhanced connection mode
was designed the way it is. I disagree that it's "cumbersome",
but that's a nit.

The issue is that the responder reaches RTS at a different time
than the initiator reaches RTR. The original iWARP connection
model did not make any requirement, and races were possible.
MPAv2 fixes those.

SIW is certainly within its rights to prevent peers from
sending prior to responder RTR. I'd suggest that if possible,
you try to detect the deadlock rather than flatly rejecting any
incoming bytes. The Internet Principle (be liberal in what you
accept...), after all.

This same dance happens in IB/RoCE, btw. Just via different
messages.

Tom.


> 
> Thanks,
> Bernard.
> 
> 
>>>
>>> Sorry for the confusion. But the current
>>> siw code appears to be just correct.
>>>
>>> Thanks
>>> Bernard
>>>
>>>
>>>
>>>> +       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
>>>> MPA_REVISION_2) &&
>>>> +               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
>>>> +               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord &
>>>> +                               (MPA_V2_RDMA_WRITE_RTR |
>>>> MPA_V2_RDMA_READ_RTR);
>>>> +               if (!mpa_p2p_mode)
>>>> +                       extra_data_check = 0;
>>>> +       }
>>>> +
>>>>         /*
>>>>          * At this point, we must have hdr->params.pd_len != 0.
>>>>          * A private data buffer gets allocated if
>> hdr->params.pd_len
>>>> !=
>>>>          * 0.
>>>>          */
>>>>         if (!cep->mpa.pdata) {
>>>> -               cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>>>> +               cep->mpa.pdata = kmalloc(pd_len +
>> extra_data_check,
>>>> GFP_KERNEL);
>>>>                 if (!cep->mpa.pdata)
>>>>                         return -ENOMEM;
>>>>         }
>>>>         rcvd = ksock_recv(
>>>>                 s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
>>>> sizeof(struct
>>>> mpa_rr),
>>>> -               to_rcv + 4, MSG_DONTWAIT);
>>>> +               to_rcv + extra_data_check, MSG_DONTWAIT);
>>>>
>>>>         if (rcvd < 0)
>>>>                 return rcvd;
>>>>
>>>> -       if (rcvd > to_rcv)
>>>> +       if (extra_data_check && (rcvd > to_rcv))
>>>>                 return -EPROTO;
>>>>
>>>>         cep->mpa.bytes_rcvd += rcvd;
>>>>
>>>> -Krishna.
>>>>>
>>>>> Tom.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Bernard.
>>>>>>
>>>>>>>
>>>>>>>> So, for the MPA request case, this code is needed
>>>>>>>> to check for protocol correctness.
>>>>>>>> You are right for the MPA reply case - if we are
>>>>>>>> _not_ in peer2peer mode, the peer can immediately
>>>>>>>> start sending data in RDMA mode after its MPA Reply.
>>>>>>>> So we shall add appropriate code to be more specific
>>>>>>>> For an error, we are (1) processing an MPA Request,
>>>>>>>> OR (2) processing an MPA Reply AND we are not expected
>>>>>>>> to send an initial READ/WRITE/Send as negotiated with
>>>>>>>> the peer (peer2peer mode MPA handshake).
>>>>>>>>
>>>>>>>> Just removing this check would make siw more permissive,
>>>>>>>> but to a point where peer MPA protocol errors are
>>>>>>>> tolerated. I am not in favor of that level of
>>>>>>>> forgiveness.
>>>>>>>>
>>>>>>>> If possible, please provide an appropriate patch
>>>>>>>> or (if it causes current issues with another peer
>>>>>>>> iWarp implementation) just run in MPA peer2peer mode,
>>>>>>>> where the current check is appropriate.
>>>>>>>> Otherwise, I would provide an appropriate fix by Monday
>>>>>>>> (I am still out of office this week).
>>>>>>>>
>>>>>>>>
>>>>>>>> Many thanks and best regards,
>>>>>>>> Bernard.
>>>>>>>>
Tom Talpey Aug. 9, 2019, 12:32 p.m. UTC | #12
On 8/8/2019 12:46 PM, Krishnamraju Eraparaju wrote:
> On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard Metzler wrote:
>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
>>
>>> To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>> Date: 08/05/2019 07:26PM
>>> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
>>> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
>>> <bharat@chelsio.com>, "Nirranjan Kirubaharan" <nirranjan@chelsio.com>
>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
>>> to read beyond MPA message
>>>
>>> On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom Talpey wrote:
>>>> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
>>>>> -----"Tom Talpey" <tom@talpey.com> wrote: -----
>>>>>
>>>>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
>>> Eraparaju"
>>>>>> <krishna2@chelsio.com>
>>>>>> From: "Tom Talpey" <tom@talpey.com>
>>>>>> Date: 08/01/2019 08:54PM
>>>>>> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
>>> bharat@chelsio.com,
>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com
>>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>>> tries
>>>>>> to read beyond MPA message
>>>>>>
>>>>>> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
>>>>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
>>> -----
>>>>>>>
>>>>>>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>>>>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>>>>>>> Date: 07/31/2019 12:34PM
>>>>>>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
>>>>>> Eraparaju"
>>>>>>>> <krishna2@chelsio.com>
>>>>>>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler
>>> tries to
>>>>>>>> read beyond MPA message
>>>>>>>>
>>>>>>>> while processing MPA Reply, SIW driver is trying to read extra
>>> 4
>>>>>>>> bytes
>>>>>>>> than what peer has advertised as private data length.
>>>>>>>>
>>>>>>>> If a FPDU data is received before even siw_recv_mpa_rr()
>>> completed
>>>>>>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr()
>>> could
>>>>>> also
>>>>>>>> read FPDU, if "size" is larger than advertised MPA reply
>>> length.
>>>>>>>>
>>>>>>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>>>>>>>> 502 {
>>>>>>>>             .............
>>>>>>>> 572
>>>>>>>> 573         if (rcvd > to_rcv)
>>>>>>>> 574                 return -EPROTO;   <----- Failure here
>>>>>>>>
>>>>>>>> Looks like the intention here is to throw an ERROR if the
>>> received
>>>>>>>> data
>>>>>>>> is more than the total private data length advertised by the
>>> peer.
>>>>>>>> But
>>>>>>>> reading beyond MPA message causes siw_cm to generate
>>>>>>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv buffer
>>> is
>>>>>>>> already
>>>>>>>> queued with FPDU messages.
>>>>>>>>
>>>>>>>> Hence, this function should only read upto private data
>>> length.
>>>>>>>>
>>>>>>>> Signed-off-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
>>>>>>>> ---
>>>>>>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>> index a7cde98e73e8..8dc8cea2566c 100644
>>>>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
>>> siw_cep
>>>>>> *cep)
>>>>>>>> 	 * A private data buffer gets allocated if hdr->params.pd_len
>>> !=
>>>>>> 0.
>>>>>>>> 	 */
>>>>>>>> 	if (!cep->mpa.pdata) {
>>>>>>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>>>>>>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>>>>>>>> 		if (!cep->mpa.pdata)
>>>>>>>> 			return -ENOMEM;
>>>>>>>> 	}
>>>>>>>> 	rcvd = ksock_recv(
>>>>>>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct
>>> mpa_rr),
>>>>>>>> -		to_rcv + 4, MSG_DONTWAIT);
>>>>>>>> +		to_rcv, MSG_DONTWAIT);
>>>>>>>>
>>>>>>>> 	if (rcvd < 0)
>>>>>>>> 		return rcvd;
>>>>>>>> -- 
>>>>>>>> 2.23.0.rc0
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> The intention of this code is to make sure the
>>>>>>> peer does not violates the MPA handshake rules.
>>>>>>> The initiator MUST NOT send extra data after its
>>>>>>> MPA request and before receiving the MPA reply.
>>>>>>
>>>>>> I think this is true only for MPA v2. With MPA v1, the
>>>>>> initiator can begin sending immediately (before receiving
>>>>>> the MPA reply), because there is no actual negotiation at
>>>>>> the MPA layer.
>>>>>>
>>>>>> With MPA v2, the negotiation exchange is required because
>>>>>> the type of the following message is predicated by the
>>>>>> "Enhanced mode" a|b|c|d flags present in the first 32 bits
>>>>>> of the private data buffer.
>>>>>>
>>>>>> So, it seems to me that additional logic is needed here to
>>>>>> determine the MPA version, before sniffing additional octets
>>>>> >from the incoming stream?
>>>>>>
>>>>>> Tom.
>>>>>>
>>>>>
>>>>> There still is the marker negotiation taking place.
>>>>> RFC 5044 says in section 7.1.2:
>>>>>
>>>>> "Note: Since the receiver's ability to deal with Markers is
>>>>>    unknown until the Request and Reply Frames have been
>>>>>    received, sending FPDUs before this occurs is not possible."
>>>>>
>>>>> This section further discusses the responder's behavior,
>>>>> where it MUST receive a first FPDU from the initiator
>>>>> before sending its first FPDU:
>>>>>
>>>>> "4.  MPA Responder mode implementations MUST receive and validate
>>> at
>>>>>          least one FPDU before sending any FPDUs or Markers."
>>>>>
>>>>> So it appears with MPA version 1, the current siw
>>>>> code is correct. The initiator is entering FPDU mode
>>>>> first, and only after receiving the MPA reply frame.
>>>>> Only after the initiator sent a first FPDU, the responder
>>>>> can start using the connection in FPDU mode.
>>>>> Because of this somehow broken connection establishment
>>>>> procedure (only the initiator can start sending data), a
>>>>> later MPA version makes this first FPDU exchange explicit
>>>>> and selectable (zero length READ/WRITE/Send).
>>>>
>>>> Yeah, I guess so. Because nobody ever actually implemented markers,
>>>> I think that they may more or less passively ignore this. But
>>> you're
>>>> currect that it's invalid protocol behavior.
>>>>
>>>> If your testing didn't uncover any issues with existing
>>> implementations
>>>> failing to connect with your strict checking, I'm ok with it.
>>> Tom & Bernard,
>>> Thanks for the insight on MPA negotiation.
>>>
>>> Could the below patch be considered as a proper fix?
>>>
>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>> index 9ce8a1b925d2..0aec1b5212f9 100644
>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>> @@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct siw_cep *cep)
>>>         struct socket *s = cep->sock;
>>>         u16 pd_len;
>>>         int rcvd, to_rcv;
>>> +       int extra_data_check = 4; /* 4Bytes, for MPA rules violation
>>> checking */
>>>
>>>         if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
>>>                 rcvd = ksock_recv(s, (char *)hdr +
>>> cep->mpa.bytes_rcvd,
>>> @@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct siw_cep *cep)
>>>                 return -EPROTO;
>>>         }
>>>
>>> +       /*
>>> +        * Peer must not send any extra data other than MPA messages
>>> until MPA
>>> +        * negotiation is completed, an exception is MPA V2
>>> client-server Mode,
>>> +        * IE, in this mode the peer after sending MPA Reply can
>>> immediately
>>> +        * start sending data in RDMA mode.
>>> +        */
>>
>> This is unfortunately not true. The responder NEVER sends
>> an FPDU without having seen an FPDU from the initiator.
>> I just checked RFC 6581 again. The RTR (ready-to-receive)
>> indication from the initiator is still needed, but now
>> provided by the protocol and not the application: w/o
>> 'enhanced MPA setup', the initiator sends the first
>> FPDU as an application message. With 'enhanced MPA setup',
>> the initiator protocol entity sends (w/o application
>> interaction) a zero length READ/WRITE/Send as a first FPDU,
>> as previously negotiated with the responder. Again: only
>> after the responder has seen the first FPDU, it can
>> start sending in FPDU mode.
> If I understand your statements correctly: in MPA v2 clint-server mode,
> the responder application should have some logic to wait(after ESTABLISH
> event) until the first FPDU message is received from the initiator?

Krishna, this is not the application's responsibility. This is MPA,
and the iWARP provider's domain.

Tom.

> 
> Thanks,
> Krishna.
>>
>> Sorry for the confusion. But the current
>> siw code appears to be just correct.
>>
>> Thanks
>> Bernard
>>
>>
>>
>>> +       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
>>> MPA_REVISION_2) &&
>>> +               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
>>> +               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord &
>>> +                               (MPA_V2_RDMA_WRITE_RTR |
>>> MPA_V2_RDMA_READ_RTR);
>>> +               if (!mpa_p2p_mode)
>>> +                       extra_data_check = 0;
>>> +       }
>>> +
>>>         /*
>>>          * At this point, we must have hdr->params.pd_len != 0.
>>>          * A private data buffer gets allocated if hdr->params.pd_len
>>> !=
>>>          * 0.
>>>          */
>>>         if (!cep->mpa.pdata) {
>>> -               cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>>> +               cep->mpa.pdata = kmalloc(pd_len + extra_data_check,
>>> GFP_KERNEL);
>>>                 if (!cep->mpa.pdata)
>>>                         return -ENOMEM;
>>>         }
>>>         rcvd = ksock_recv(
>>>                 s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
>>> sizeof(struct
>>> mpa_rr),
>>> -               to_rcv + 4, MSG_DONTWAIT);
>>> +               to_rcv + extra_data_check, MSG_DONTWAIT);
>>>
>>>         if (rcvd < 0)
>>>                 return rcvd;
>>>
>>> -       if (rcvd > to_rcv)
>>> +       if (extra_data_check && (rcvd > to_rcv))
>>>                 return -EPROTO;
>>>
>>>         cep->mpa.bytes_rcvd += rcvd;
>>>
>>> -Krishna.
>>>>
>>>> Tom.
>>>>
>>>>
>>>>
>>>>>
>>>>> Bernard.
>>>>>
>>>>>>
>>>>>>> So, for the MPA request case, this code is needed
>>>>>>> to check for protocol correctness.
>>>>>>> You are right for the MPA reply case - if we are
>>>>>>> _not_ in peer2peer mode, the peer can immediately
>>>>>>> start sending data in RDMA mode after its MPA Reply.
>>>>>>> So we shall add appropriate code to be more specific
>>>>>>> For an error, we are (1) processing an MPA Request,
>>>>>>> OR (2) processing an MPA Reply AND we are not expected
>>>>>>> to send an initial READ/WRITE/Send as negotiated with
>>>>>>> the peer (peer2peer mode MPA handshake).
>>>>>>>
>>>>>>> Just removing this check would make siw more permissive,
>>>>>>> but to a point where peer MPA protocol errors are
>>>>>>> tolerated. I am not in favor of that level of
>>>>>>> forgiveness.
>>>>>>>
>>>>>>> If possible, please provide an appropriate patch
>>>>>>> or (if it causes current issues with another peer
>>>>>>> iWarp implementation) just run in MPA peer2peer mode,
>>>>>>> where the current check is appropriate.
>>>>>>> Otherwise, I would provide an appropriate fix by Monday
>>>>>>> (I am still out of office this week).
>>>>>>>
>>>>>>>
>>>>>>> Many thanks and best regards,
>>>>>>> Bernard.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>
> 
>
Bernard Metzler Aug. 9, 2019, 1:52 p.m. UTC | #13
-----"Tom Talpey" <tom@talpey.com> wrote: -----

>To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju Eraparaju"
><krishna2@chelsio.com>
>From: "Tom Talpey" <tom@talpey.com>
>Date: 08/09/2019 02:27PM
>Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
><linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
><bharat@chelsio.com>, "Nirranjan Kirubaharan" <nirranjan@chelsio.com>
>Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
>to read beyond MPA message
>
>On 8/9/2019 6:29 AM, Bernard Metzler wrote:
>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
>> 
>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>
>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>> Date: 08/08/2019 06:47PM
>>> Cc: "Tom Talpey" <tom@talpey.com>, "jgg@ziepe.ca" <jgg@ziepe.ca>,
>>> "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
>"Potnuri
>>> Bharat Teja" <bharat@chelsio.com>, "Nirranjan Kirubaharan"
>>> <nirranjan@chelsio.com>
>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>tries
>>> to read beyond MPA message
>>>
>>> On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard
>Metzler
>>> wrote:
>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
>>>>
>>>>> To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>>>> Date: 08/05/2019 07:26PM
>>>>> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
>>>>> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
>>>>> <bharat@chelsio.com>, "Nirranjan Kirubaharan"
>>> <nirranjan@chelsio.com>
>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>>> tries
>>>>> to read beyond MPA message
>>>>>
>>>>> On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom Talpey
>>> wrote:
>>>>>> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
>>>>>>> -----"Tom Talpey" <tom@talpey.com> wrote: -----
>>>>>>>
>>>>>>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
>>>>> Eraparaju"
>>>>>>>> <krishna2@chelsio.com>
>>>>>>>> From: "Tom Talpey" <tom@talpey.com>
>>>>>>>> Date: 08/01/2019 08:54PM
>>>>>>>> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
>>>>> bharat@chelsio.com,
>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com
>>>>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>>>>> tries
>>>>>>>> to read beyond MPA message
>>>>>>>>
>>>>>>>> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
>>>>>>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
>>>>> -----
>>>>>>>>>
>>>>>>>>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>>>>>>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>>>>>>>>> Date: 07/31/2019 12:34PM
>>>>>>>>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>>>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
>>>>>>>> Eraparaju"
>>>>>>>>>> <krishna2@chelsio.com>
>>>>>>>>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler
>>>>> tries to
>>>>>>>>>> read beyond MPA message
>>>>>>>>>>
>>>>>>>>>> while processing MPA Reply, SIW driver is trying to read
>>> extra
>>>>> 4
>>>>>>>>>> bytes
>>>>>>>>>> than what peer has advertised as private data length.
>>>>>>>>>>
>>>>>>>>>> If a FPDU data is received before even siw_recv_mpa_rr()
>>>>> completed
>>>>>>>>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr()
>>>>> could
>>>>>>>> also
>>>>>>>>>> read FPDU, if "size" is larger than advertised MPA reply
>>>>> length.
>>>>>>>>>>
>>>>>>>>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>>>>>>>>>> 502 {
>>>>>>>>>>             .............
>>>>>>>>>> 572
>>>>>>>>>> 573         if (rcvd > to_rcv)
>>>>>>>>>> 574                 return -EPROTO;   <----- Failure here
>>>>>>>>>>
>>>>>>>>>> Looks like the intention here is to throw an ERROR if the
>>>>> received
>>>>>>>>>> data
>>>>>>>>>> is more than the total private data length advertised by
>>> the
>>>>> peer.
>>>>>>>>>> But
>>>>>>>>>> reading beyond MPA message causes siw_cm to generate
>>>>>>>>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv
>>> buffer
>>>>> is
>>>>>>>>>> already
>>>>>>>>>> queued with FPDU messages.
>>>>>>>>>>
>>>>>>>>>> Hence, this function should only read upto private data
>>>>> length.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Krishnamraju Eraparaju
>>> <krishna2@chelsio.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>>> index a7cde98e73e8..8dc8cea2566c 100644
>>>>>>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
>>>>> siw_cep
>>>>>>>> *cep)
>>>>>>>>>> 	 * A private data buffer gets allocated if
>>> hdr->params.pd_len
>>>>> !=
>>>>>>>> 0.
>>>>>>>>>> 	 */
>>>>>>>>>> 	if (!cep->mpa.pdata) {
>>>>>>>>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>>>>>>>>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>>>>>>>>>> 		if (!cep->mpa.pdata)
>>>>>>>>>> 			return -ENOMEM;
>>>>>>>>>> 	}
>>>>>>>>>> 	rcvd = ksock_recv(
>>>>>>>>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct
>>>>> mpa_rr),
>>>>>>>>>> -		to_rcv + 4, MSG_DONTWAIT);
>>>>>>>>>> +		to_rcv, MSG_DONTWAIT);
>>>>>>>>>>
>>>>>>>>>> 	if (rcvd < 0)
>>>>>>>>>> 		return rcvd;
>>>>>>>>>> -- 
>>>>>>>>>> 2.23.0.rc0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The intention of this code is to make sure the
>>>>>>>>> peer does not violates the MPA handshake rules.
>>>>>>>>> The initiator MUST NOT send extra data after its
>>>>>>>>> MPA request and before receiving the MPA reply.
>>>>>>>>
>>>>>>>> I think this is true only for MPA v2. With MPA v1, the
>>>>>>>> initiator can begin sending immediately (before receiving
>>>>>>>> the MPA reply), because there is no actual negotiation at
>>>>>>>> the MPA layer.
>>>>>>>>
>>>>>>>> With MPA v2, the negotiation exchange is required because
>>>>>>>> the type of the following message is predicated by the
>>>>>>>> "Enhanced mode" a|b|c|d flags present in the first 32 bits
>>>>>>>> of the private data buffer.
>>>>>>>>
>>>>>>>> So, it seems to me that additional logic is needed here to
>>>>>>>> determine the MPA version, before sniffing additional octets
>>>>>>> >from the incoming stream?
>>>>>>>>
>>>>>>>> Tom.
>>>>>>>>
>>>>>>>
>>>>>>> There still is the marker negotiation taking place.
>>>>>>> RFC 5044 says in section 7.1.2:
>>>>>>>
>>>>>>> "Note: Since the receiver's ability to deal with Markers is
>>>>>>>    unknown until the Request and Reply Frames have been
>>>>>>>    received, sending FPDUs before this occurs is not
>possible."
>>>>>>>
>>>>>>> This section further discusses the responder's behavior,
>>>>>>> where it MUST receive a first FPDU from the initiator
>>>>>>> before sending its first FPDU:
>>>>>>>
>>>>>>> "4.  MPA Responder mode implementations MUST receive and
>>> validate
>>>>> at
>>>>>>>          least one FPDU before sending any FPDUs or Markers."
>>>>>>>
>>>>>>> So it appears with MPA version 1, the current siw
>>>>>>> code is correct. The initiator is entering FPDU mode
>>>>>>> first, and only after receiving the MPA reply frame.
>>>>>>> Only after the initiator sent a first FPDU, the responder
>>>>>>> can start using the connection in FPDU mode.
>>>>>>> Because of this somehow broken connection establishment
>>>>>>> procedure (only the initiator can start sending data), a
>>>>>>> later MPA version makes this first FPDU exchange explicit
>>>>>>> and selectable (zero length READ/WRITE/Send).
>>>>>>
>>>>>> Yeah, I guess so. Because nobody ever actually implemented
>>> markers,
>>>>>> I think that they may more or less passively ignore this. But
>>>>> you're
>>>>>> currect that it's invalid protocol behavior.
>>>>>>
>>>>>> If your testing didn't uncover any issues with existing
>>>>> implementations
>>>>>> failing to connect with your strict checking, I'm ok with it.
>>>>> Tom & Bernard,
>>>>> Thanks for the insight on MPA negotiation.
>>>>>
>>>>> Could the below patch be considered as a proper fix?
>>>>>
>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>>>> index 9ce8a1b925d2..0aec1b5212f9 100644
>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>>>> @@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct siw_cep
>>> *cep)
>>>>>         struct socket *s = cep->sock;
>>>>>         u16 pd_len;
>>>>>         int rcvd, to_rcv;
>>>>> +       int extra_data_check = 4; /* 4Bytes, for MPA rules
>>> violation
>>>>> checking */
>>>>>
>>>>>         if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
>>>>>                 rcvd = ksock_recv(s, (char *)hdr +
>>>>> cep->mpa.bytes_rcvd,
>>>>> @@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct siw_cep
>>> *cep)
>>>>>                 return -EPROTO;
>>>>>         }
>>>>>
>>>>> +       /*
>>>>> +        * Peer must not send any extra data other than MPA
>>> messages
>>>>> until MPA
>>>>> +        * negotiation is completed, an exception is MPA V2
>>>>> client-server Mode,
>>>>> +        * IE, in this mode the peer after sending MPA Reply can
>>>>> immediately
>>>>> +        * start sending data in RDMA mode.
>>>>> +        */
>>>>
>>>> This is unfortunately not true. The responder NEVER sends
>>>> an FPDU without having seen an FPDU from the initiator.
>>>> I just checked RFC 6581 again. The RTR (ready-to-receive)
>>>> indication from the initiator is still needed, but now
>>>> provided by the protocol and not the application: w/o
>>>> 'enhanced MPA setup', the initiator sends the first
>>>> FPDU as an application message. With 'enhanced MPA setup',
>>>> the initiator protocol entity sends (w/o application
>>>> interaction) a zero length READ/WRITE/Send as a first FPDU,
>>>> as previously negotiated with the responder. Again: only
>>>> after the responder has seen the first FPDU, it can
>>>> start sending in FPDU mode.
>>> If I understand your statements correctly: in MPA v2 clint-server
>>> mode,
>>> the responder application should have some logic to wait(after
>>> ESTABLISH
>>> event) until the first FPDU message is received from the
>initiator?
>>>
>>> Thanks,
>>> Krishna.
>> 
>> The responder may either delay the established event
>> until it sees the first peer FPDU, or it delays sending
>> the first (now already posted) FPDU, until it sees the
>> first initiator FPDU. This, since (theoretically), the
>> responder does not know yet the result of CRC and
>> Marker negotiation. siw delays the established event,
>> which seems the more clear path to me.
>> 
>> This MPA handshake is cumbersome. siw is permissive
>> for violations of these rules to the extend, that the
>> responder shall not send back to back the first FPDU
>> with the MPA reply. If a (Chelsio ?) iWarp adapter
>> would send the first responder FPDU only after it
>> got TCP acked its MPA reply, the chances are high the
>> (siw) iWarp initiator has already transitioned to RTS
>> and it would accept that FPU, even if not having sent
>> it's first FPDU.
>> 
>> siw sticks to this rule to that extend, since all receive
>> processing is triggered by socket callbacks (from
>> sk->sk_data_ready()). As long as the QP is not
>> in RTS, all TCP bytestream processing is done by the
>> MPA protocol (code in siw_cm.c). If the QP reaches
>> RTS, RX processing is done bt the full iWarp
>> protocol (code in siw_qp_rx.c for RX).
>> Now, if the higher layer (application) protocol has
>> a semantic where the responder sends the first message,
>> we could end up in an infinite wait for that packet
>> at initiator application side.
>> This, since the complete first FPDU was already
>> received at the TCP socket, while the QP was not in
>> RTS. It will not trigger any additional sk->sk_data_ready()
>> and we are stuck with a FPDU in the socket rx buffer.
>> 
>> I implemented that 4 bytes extra data testing only to
>> keep the siw protocol state machine as simple as
>> possible (it is already unfortunately complex, if
>> you count the lines in siw_cm.c), while following
>> the protocol rules.
>> 
>> I suggest to correctly implement peer2peer mode and
>> delay the established event at the responder side until
>> it got the negotiated first zero length FPDU. Out
>> of the possible options, siw supports both zero length
>> WRITE and READ, but no SEND, since this would consume
>> an extra receive WQE.
>> 
>> As a last resort, I might consider extending the
>> siw state machine to handle those - non-complaint -
>> cases gracefully. But, that would result in different
>> code than just avoiding checking for peer protocol
>> violation.
>
>Bernard, everything you describe in siw seems perfectly valid
>to me, and in fact that's why the MPA enhanced connection mode
>was designed the way it is. I disagree that it's "cumbersome",
>but that's a nit.
>
>The issue is that the responder reaches RTS at a different time
>than the initiator reaches RTR. The original iWARP connection
>model did not make any requirement, and races were possible.
>MPAv2 fixes those.
>
Tom, thanks, exactly. MPAv2 fixes it. And an implementation
of that fix MUST adhere to the defined rules. If one
negotiates an extra handshake to synchronize, it
MUST adhere to that handshake rules. There is no point in
negotiating an extra handshake, and right away ignoring it.
If the responder wants a zero length WRITE, it MUST use
its reception to synchronize its transition to RTS.

Shall we extend the siw state machine to support silly
peer behavior? As said, if I read there is no way of
making the peer implementation RFC compliant (it would
be unexpected), I'll look into the applicability of
Postel's Rule. But let's make those things explicit.

>SIW is certainly within its rights to prevent peers from
>sending prior to responder RTR. I'd suggest that if possible,
>you try to detect the deadlock rather than flatly rejecting any
>incoming bytes. The Internet Principle (be liberal in what you
>accept...), after all.
>
>This same dance happens in IB/RoCE, btw. Just via different
>messages.
>
Interesting!

Thanks
Bernard.

>Tom.
>
>
>> 
>> Thanks,
>> Bernard.
>> 
>> 
>>>>
>>>> Sorry for the confusion. But the current
>>>> siw code appears to be just correct.
>>>>
>>>> Thanks
>>>> Bernard
>>>>
>>>>
>>>>
>>>>> +       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
>>>>> MPA_REVISION_2) &&
>>>>> +               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
>>>>> +               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord &
>>>>> +                               (MPA_V2_RDMA_WRITE_RTR |
>>>>> MPA_V2_RDMA_READ_RTR);
>>>>> +               if (!mpa_p2p_mode)
>>>>> +                       extra_data_check = 0;
>>>>> +       }
>>>>> +
>>>>>         /*
>>>>>          * At this point, we must have hdr->params.pd_len != 0.
>>>>>          * A private data buffer gets allocated if
>>> hdr->params.pd_len
>>>>> !=
>>>>>          * 0.
>>>>>          */
>>>>>         if (!cep->mpa.pdata) {
>>>>> -               cep->mpa.pdata = kmalloc(pd_len + 4,
>GFP_KERNEL);
>>>>> +               cep->mpa.pdata = kmalloc(pd_len +
>>> extra_data_check,
>>>>> GFP_KERNEL);
>>>>>                 if (!cep->mpa.pdata)
>>>>>                         return -ENOMEM;
>>>>>         }
>>>>>         rcvd = ksock_recv(
>>>>>                 s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
>>>>> sizeof(struct
>>>>> mpa_rr),
>>>>> -               to_rcv + 4, MSG_DONTWAIT);
>>>>> +               to_rcv + extra_data_check, MSG_DONTWAIT);
>>>>>
>>>>>         if (rcvd < 0)
>>>>>                 return rcvd;
>>>>>
>>>>> -       if (rcvd > to_rcv)
>>>>> +       if (extra_data_check && (rcvd > to_rcv))
>>>>>                 return -EPROTO;
>>>>>
>>>>>         cep->mpa.bytes_rcvd += rcvd;
>>>>>
>>>>> -Krishna.
>>>>>>
>>>>>> Tom.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Bernard.
>>>>>>>
>>>>>>>>
>>>>>>>>> So, for the MPA request case, this code is needed
>>>>>>>>> to check for protocol correctness.
>>>>>>>>> You are right for the MPA reply case - if we are
>>>>>>>>> _not_ in peer2peer mode, the peer can immediately
>>>>>>>>> start sending data in RDMA mode after its MPA Reply.
>>>>>>>>> So we shall add appropriate code to be more specific
>>>>>>>>> For an error, we are (1) processing an MPA Request,
>>>>>>>>> OR (2) processing an MPA Reply AND we are not expected
>>>>>>>>> to send an initial READ/WRITE/Send as negotiated with
>>>>>>>>> the peer (peer2peer mode MPA handshake).
>>>>>>>>>
>>>>>>>>> Just removing this check would make siw more permissive,
>>>>>>>>> but to a point where peer MPA protocol errors are
>>>>>>>>> tolerated. I am not in favor of that level of
>>>>>>>>> forgiveness.
>>>>>>>>>
>>>>>>>>> If possible, please provide an appropriate patch
>>>>>>>>> or (if it causes current issues with another peer
>>>>>>>>> iWarp implementation) just run in MPA peer2peer mode,
>>>>>>>>> where the current check is appropriate.
>>>>>>>>> Otherwise, I would provide an appropriate fix by Monday
>>>>>>>>> (I am still out of office this week).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Many thanks and best regards,
>>>>>>>>> Bernard.
>>>>>>>>>
>
>
Tom Talpey Aug. 9, 2019, 8:35 p.m. UTC | #14
On 8/9/2019 9:52 AM, Bernard Metzler wrote:
> -----"Tom Talpey" <tom@talpey.com> wrote: -----
> 
>> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju Eraparaju"
>> <krishna2@chelsio.com>
>> From: "Tom Talpey" <tom@talpey.com>
>> Date: 08/09/2019 02:27PM
>> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
>> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
>> <bharat@chelsio.com>, "Nirranjan Kirubaharan" <nirranjan@chelsio.com>
>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
>> to read beyond MPA message
>>
>> On 8/9/2019 6:29 AM, Bernard Metzler wrote:
>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
>>>
>>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>
>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>>> Date: 08/08/2019 06:47PM
>>>> Cc: "Tom Talpey" <tom@talpey.com>, "jgg@ziepe.ca" <jgg@ziepe.ca>,
>>>> "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
>> "Potnuri
>>>> Bharat Teja" <bharat@chelsio.com>, "Nirranjan Kirubaharan"
>>>> <nirranjan@chelsio.com>
>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>> tries
>>>> to read beyond MPA message
>>>>
>>>> On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard
>> Metzler
>>>> wrote:
>>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
>>>>>
>>>>>> To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
>>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>>>>> Date: 08/05/2019 07:26PM
>>>>>> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
>>>>>> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
>>>>>> <bharat@chelsio.com>, "Nirranjan Kirubaharan"
>>>> <nirranjan@chelsio.com>
>>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>>>> tries
>>>>>> to read beyond MPA message
>>>>>>
>>>>>> On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom Talpey
>>>> wrote:
>>>>>>> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
>>>>>>>> -----"Tom Talpey" <tom@talpey.com> wrote: -----
>>>>>>>>
>>>>>>>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
>>>>>> Eraparaju"
>>>>>>>>> <krishna2@chelsio.com>
>>>>>>>>> From: "Tom Talpey" <tom@talpey.com>
>>>>>>>>> Date: 08/01/2019 08:54PM
>>>>>>>>> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
>>>>>> bharat@chelsio.com,
>>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com
>>>>>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>>>>>> tries
>>>>>>>>> to read beyond MPA message
>>>>>>>>>
>>>>>>>>> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
>>>>>>>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
>>>>>> -----
>>>>>>>>>>
>>>>>>>>>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>>>>>>>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>>>>>>>>>>> Date: 07/31/2019 12:34PM
>>>>>>>>>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>>>>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
>>>>>>>>> Eraparaju"
>>>>>>>>>>> <krishna2@chelsio.com>
>>>>>>>>>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler
>>>>>> tries to
>>>>>>>>>>> read beyond MPA message
>>>>>>>>>>>
>>>>>>>>>>> while processing MPA Reply, SIW driver is trying to read
>>>> extra
>>>>>> 4
>>>>>>>>>>> bytes
>>>>>>>>>>> than what peer has advertised as private data length.
>>>>>>>>>>>
>>>>>>>>>>> If a FPDU data is received before even siw_recv_mpa_rr()
>>>>>> completed
>>>>>>>>>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr()
>>>>>> could
>>>>>>>>> also
>>>>>>>>>>> read FPDU, if "size" is larger than advertised MPA reply
>>>>>> length.
>>>>>>>>>>>
>>>>>>>>>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>>>>>>>>>>> 502 {
>>>>>>>>>>>              .............
>>>>>>>>>>> 572
>>>>>>>>>>> 573         if (rcvd > to_rcv)
>>>>>>>>>>> 574                 return -EPROTO;   <----- Failure here
>>>>>>>>>>>
>>>>>>>>>>> Looks like the intention here is to throw an ERROR if the
>>>>>> received
>>>>>>>>>>> data
>>>>>>>>>>> is more than the total private data length advertised by
>>>> the
>>>>>> peer.
>>>>>>>>>>> But
>>>>>>>>>>> reading beyond MPA message causes siw_cm to generate
>>>>>>>>>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv
>>>> buffer
>>>>>> is
>>>>>>>>>>> already
>>>>>>>>>>> queued with FPDU messages.
>>>>>>>>>>>
>>>>>>>>>>> Hence, this function should only read upto private data
>>>>>> length.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Krishnamraju Eraparaju
>>>> <krishna2@chelsio.com>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>>>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>>>> index a7cde98e73e8..8dc8cea2566c 100644
>>>>>>>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>>>>>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
>>>>>> siw_cep
>>>>>>>>> *cep)
>>>>>>>>>>> 	 * A private data buffer gets allocated if
>>>> hdr->params.pd_len
>>>>>> !=
>>>>>>>>> 0.
>>>>>>>>>>> 	 */
>>>>>>>>>>> 	if (!cep->mpa.pdata) {
>>>>>>>>>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>>>>>>>>>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>>>>>>>>>>> 		if (!cep->mpa.pdata)
>>>>>>>>>>> 			return -ENOMEM;
>>>>>>>>>>> 	}
>>>>>>>>>>> 	rcvd = ksock_recv(
>>>>>>>>>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct
>>>>>> mpa_rr),
>>>>>>>>>>> -		to_rcv + 4, MSG_DONTWAIT);
>>>>>>>>>>> +		to_rcv, MSG_DONTWAIT);
>>>>>>>>>>>
>>>>>>>>>>> 	if (rcvd < 0)
>>>>>>>>>>> 		return rcvd;
>>>>>>>>>>> -- 
>>>>>>>>>>> 2.23.0.rc0
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The intention of this code is to make sure the
>>>>>>>>>> peer does not violates the MPA handshake rules.
>>>>>>>>>> The initiator MUST NOT send extra data after its
>>>>>>>>>> MPA request and before receiving the MPA reply.
>>>>>>>>>
>>>>>>>>> I think this is true only for MPA v2. With MPA v1, the
>>>>>>>>> initiator can begin sending immediately (before receiving
>>>>>>>>> the MPA reply), because there is no actual negotiation at
>>>>>>>>> the MPA layer.
>>>>>>>>>
>>>>>>>>> With MPA v2, the negotiation exchange is required because
>>>>>>>>> the type of the following message is predicated by the
>>>>>>>>> "Enhanced mode" a|b|c|d flags present in the first 32 bits
>>>>>>>>> of the private data buffer.
>>>>>>>>>
>>>>>>>>> So, it seems to me that additional logic is needed here to
>>>>>>>>> determine the MPA version, before sniffing additional octets
>>>>>>>> >from the incoming stream?
>>>>>>>>>
>>>>>>>>> Tom.
>>>>>>>>>
>>>>>>>>
>>>>>>>> There still is the marker negotiation taking place.
>>>>>>>> RFC 5044 says in section 7.1.2:
>>>>>>>>
>>>>>>>> "Note: Since the receiver's ability to deal with Markers is
>>>>>>>>     unknown until the Request and Reply Frames have been
>>>>>>>>     received, sending FPDUs before this occurs is not
>> possible."
>>>>>>>>
>>>>>>>> This section further discusses the responder's behavior,
>>>>>>>> where it MUST receive a first FPDU from the initiator
>>>>>>>> before sending its first FPDU:
>>>>>>>>
>>>>>>>> "4.  MPA Responder mode implementations MUST receive and
>>>> validate
>>>>>> at
>>>>>>>>           least one FPDU before sending any FPDUs or Markers."
>>>>>>>>
>>>>>>>> So it appears with MPA version 1, the current siw
>>>>>>>> code is correct. The initiator is entering FPDU mode
>>>>>>>> first, and only after receiving the MPA reply frame.
>>>>>>>> Only after the initiator sent a first FPDU, the responder
>>>>>>>> can start using the connection in FPDU mode.
>>>>>>>> Because of this somehow broken connection establishment
>>>>>>>> procedure (only the initiator can start sending data), a
>>>>>>>> later MPA version makes this first FPDU exchange explicit
>>>>>>>> and selectable (zero length READ/WRITE/Send).
>>>>>>>
>>>>>>> Yeah, I guess so. Because nobody ever actually implemented
>>>> markers,
>>>>>>> I think that they may more or less passively ignore this. But
>>>>>> you're
>>>>>>> currect that it's invalid protocol behavior.
>>>>>>>
>>>>>>> If your testing didn't uncover any issues with existing
>>>>>> implementations
>>>>>>> failing to connect with your strict checking, I'm ok with it.
>>>>>> Tom & Bernard,
>>>>>> Thanks for the insight on MPA negotiation.
>>>>>>
>>>>>> Could the below patch be considered as a proper fix?
>>>>>>
>>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>> index 9ce8a1b925d2..0aec1b5212f9 100644
>>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>>>>> @@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct siw_cep
>>>> *cep)
>>>>>>          struct socket *s = cep->sock;
>>>>>>          u16 pd_len;
>>>>>>          int rcvd, to_rcv;
>>>>>> +       int extra_data_check = 4; /* 4Bytes, for MPA rules
>>>> violation
>>>>>> checking */
>>>>>>
>>>>>>          if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
>>>>>>                  rcvd = ksock_recv(s, (char *)hdr +
>>>>>> cep->mpa.bytes_rcvd,
>>>>>> @@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct siw_cep
>>>> *cep)
>>>>>>                  return -EPROTO;
>>>>>>          }
>>>>>>
>>>>>> +       /*
>>>>>> +        * Peer must not send any extra data other than MPA
>>>> messages
>>>>>> until MPA
>>>>>> +        * negotiation is completed, an exception is MPA V2
>>>>>> client-server Mode,
>>>>>> +        * IE, in this mode the peer after sending MPA Reply can
>>>>>> immediately
>>>>>> +        * start sending data in RDMA mode.
>>>>>> +        */
>>>>>
>>>>> This is unfortunately not true. The responder NEVER sends
>>>>> an FPDU without having seen an FPDU from the initiator.
>>>>> I just checked RFC 6581 again. The RTR (ready-to-receive)
>>>>> indication from the initiator is still needed, but now
>>>>> provided by the protocol and not the application: w/o
>>>>> 'enhanced MPA setup', the initiator sends the first
>>>>> FPDU as an application message. With 'enhanced MPA setup',
>>>>> the initiator protocol entity sends (w/o application
>>>>> interaction) a zero length READ/WRITE/Send as a first FPDU,
>>>>> as previously negotiated with the responder. Again: only
>>>>> after the responder has seen the first FPDU, it can
>>>>> start sending in FPDU mode.
>>>> If I understand your statements correctly: in MPA v2 clint-server
>>>> mode,
>>>> the responder application should have some logic to wait(after
>>>> ESTABLISH
>>>> event) until the first FPDU message is received from the
>> initiator?
>>>>
>>>> Thanks,
>>>> Krishna.
>>>
>>> The responder may either delay the established event
>>> until it sees the first peer FPDU, or it delays sending
>>> the first (now already posted) FPDU, until it sees the
>>> first initiator FPDU. This, since (theoretically), the
>>> responder does not know yet the result of CRC and
>>> Marker negotiation. siw delays the established event,
>>> which seems the more clear path to me.
>>>
>>> This MPA handshake is cumbersome. siw is permissive
>>> for violations of these rules to the extend, that the
>>> responder shall not send back to back the first FPDU
>>> with the MPA reply. If a (Chelsio ?) iWarp adapter
>>> would send the first responder FPDU only after it
>>> got TCP acked its MPA reply, the chances are high the
>>> (siw) iWarp initiator has already transitioned to RTS
>>> and it would accept that FPU, even if not having sent
>>> it's first FPDU.
>>>
>>> siw sticks to this rule to that extend, since all receive
>>> processing is triggered by socket callbacks (from
>>> sk->sk_data_ready()). As long as the QP is not
>>> in RTS, all TCP bytestream processing is done by the
>>> MPA protocol (code in siw_cm.c). If the QP reaches
>>> RTS, RX processing is done bt the full iWarp
>>> protocol (code in siw_qp_rx.c for RX).
>>> Now, if the higher layer (application) protocol has
>>> a semantic where the responder sends the first message,
>>> we could end up in an infinite wait for that packet
>>> at initiator application side.
>>> This, since the complete first FPDU was already
>>> received at the TCP socket, while the QP was not in
>>> RTS. It will not trigger any additional sk->sk_data_ready()
>>> and we are stuck with a FPDU in the socket rx buffer.
>>>
>>> I implemented that 4 bytes extra data testing only to
>>> keep the siw protocol state machine as simple as
>>> possible (it is already unfortunately complex, if
>>> you count the lines in siw_cm.c), while following
>>> the protocol rules.
>>>
>>> I suggest to correctly implement peer2peer mode and
>>> delay the established event at the responder side until
>>> it got the negotiated first zero length FPDU. Out
>>> of the possible options, siw supports both zero length
>>> WRITE and READ, but no SEND, since this would consume
>>> an extra receive WQE.
>>>
>>> As a last resort, I might consider extending the
>>> siw state machine to handle those - non-complaint -
>>> cases gracefully. But, that would result in different
>>> code than just avoiding checking for peer protocol
>>> violation.
>>
>> Bernard, everything you describe in siw seems perfectly valid
>> to me, and in fact that's why the MPA enhanced connection mode
>> was designed the way it is. I disagree that it's "cumbersome",
>> but that's a nit.
>>
>> The issue is that the responder reaches RTS at a different time
>> than the initiator reaches RTR. The original iWARP connection
>> model did not make any requirement, and races were possible.
>> MPAv2 fixes those.
>>
> Tom, thanks, exactly. MPAv2 fixes it. And an implementation
> of that fix MUST adhere to the defined rules. If one
> negotiates an extra handshake to synchronize, it
> MUST adhere to that handshake rules. There is no point in
> negotiating an extra handshake, and right away ignoring it.
> If the responder wants a zero length WRITE, it MUST use
> its reception to synchronize its transition to RTS.
> 
> Shall we extend the siw state machine to support silly
> peer behavior? As said, if I read there is no way of

Of course the right answer to this is "no", but I'd reserve
drawing that conclusion until I am certain it won't cause
trouble. If existing iWARP implementations don't enforce
this, and SIW becomes the odd duck failing connections, then
I'm afraid I would have to say "yes" instead.

> making the peer implementation RFC compliant (it would
> be unexpected), I'll look into the applicability of
> Postel's Rule. But let's make those things explicit.

Explicit is best, agreed.

Tom.

> 
>> SIW is certainly within its rights to prevent peers from
>> sending prior to responder RTR. I'd suggest that if possible,
>> you try to detect the deadlock rather than flatly rejecting any
>> incoming bytes. The Internet Principle (be liberal in what you
>> accept...), after all.
>>
>> This same dance happens in IB/RoCE, btw. Just via different
>> messages.
>>
> Interesting!
> 
> Thanks
> Bernard.
> 
>> Tom.
>>
>>
>>>
>>> Thanks,
>>> Bernard.
>>>
>>>
>>>>>
>>>>> Sorry for the confusion. But the current
>>>>> siw code appears to be just correct.
>>>>>
>>>>> Thanks
>>>>> Bernard
>>>>>
>>>>>
>>>>>
>>>>>> +       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
>>>>>> MPA_REVISION_2) &&
>>>>>> +               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
>>>>>> +               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord &
>>>>>> +                               (MPA_V2_RDMA_WRITE_RTR |
>>>>>> MPA_V2_RDMA_READ_RTR);
>>>>>> +               if (!mpa_p2p_mode)
>>>>>> +                       extra_data_check = 0;
>>>>>> +       }
>>>>>> +
>>>>>>          /*
>>>>>>           * At this point, we must have hdr->params.pd_len != 0.
>>>>>>           * A private data buffer gets allocated if
>>>> hdr->params.pd_len
>>>>>> !=
>>>>>>           * 0.
>>>>>>           */
>>>>>>          if (!cep->mpa.pdata) {
>>>>>> -               cep->mpa.pdata = kmalloc(pd_len + 4,
>> GFP_KERNEL);
>>>>>> +               cep->mpa.pdata = kmalloc(pd_len +
>>>> extra_data_check,
>>>>>> GFP_KERNEL);
>>>>>>                  if (!cep->mpa.pdata)
>>>>>>                          return -ENOMEM;
>>>>>>          }
>>>>>>          rcvd = ksock_recv(
>>>>>>                  s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
>>>>>> sizeof(struct
>>>>>> mpa_rr),
>>>>>> -               to_rcv + 4, MSG_DONTWAIT);
>>>>>> +               to_rcv + extra_data_check, MSG_DONTWAIT);
>>>>>>
>>>>>>          if (rcvd < 0)
>>>>>>                  return rcvd;
>>>>>>
>>>>>> -       if (rcvd > to_rcv)
>>>>>> +       if (extra_data_check && (rcvd > to_rcv))
>>>>>>                  return -EPROTO;
>>>>>>
>>>>>>          cep->mpa.bytes_rcvd += rcvd;
>>>>>>
>>>>>> -Krishna.
>>>>>>>
>>>>>>> Tom.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Bernard.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> So, for the MPA request case, this code is needed
>>>>>>>>>> to check for protocol correctness.
>>>>>>>>>> You are right for the MPA reply case - if we are
>>>>>>>>>> _not_ in peer2peer mode, the peer can immediately
>>>>>>>>>> start sending data in RDMA mode after its MPA Reply.
>>>>>>>>>> So we shall add appropriate code to be more specific
>>>>>>>>>> For an error, we are (1) processing an MPA Request,
>>>>>>>>>> OR (2) processing an MPA Reply AND we are not expected
>>>>>>>>>> to send an initial READ/WRITE/Send as negotiated with
>>>>>>>>>> the peer (peer2peer mode MPA handshake).
>>>>>>>>>>
>>>>>>>>>> Just removing this check would make siw more permissive,
>>>>>>>>>> but to a point where peer MPA protocol errors are
>>>>>>>>>> tolerated. I am not in favor of that level of
>>>>>>>>>> forgiveness.
>>>>>>>>>>
>>>>>>>>>> If possible, please provide an appropriate patch
>>>>>>>>>> or (if it causes current issues with another peer
>>>>>>>>>> iWarp implementation) just run in MPA peer2peer mode,
>>>>>>>>>> where the current check is appropriate.
>>>>>>>>>> Otherwise, I would provide an appropriate fix by Monday
>>>>>>>>>> (I am still out of office this week).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Many thanks and best regards,
>>>>>>>>>> Bernard.
>>>>>>>>>>
>>
>>
> 
> 
>
Krishnamraju Eraparaju Aug. 12, 2019, 9:58 a.m. UTC | #15
On Saturday, August 08/10/19, 2019 at 02:05:00 +0530, Tom Talpey wrote:
> On 8/9/2019 9:52 AM, Bernard Metzler wrote:
> > -----"Tom Talpey" <tom@talpey.com> wrote: -----
> > 
> >> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju Eraparaju"
> >> <krishna2@chelsio.com>
> >> From: "Tom Talpey" <tom@talpey.com>
> >> Date: 08/09/2019 02:27PM
> >> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
> >> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
> >> <bharat@chelsio.com>, "Nirranjan Kirubaharan" <nirranjan@chelsio.com>
> >> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
> >> to read beyond MPA message
> >>
> >> On 8/9/2019 6:29 AM, Bernard Metzler wrote:
> >>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
> >>>
> >>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>
> >>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >>>> Date: 08/08/2019 06:47PM
> >>>> Cc: "Tom Talpey" <tom@talpey.com>, "jgg@ziepe.ca" <jgg@ziepe.ca>,
> >>>> "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
> >> "Potnuri
> >>>> Bharat Teja" <bharat@chelsio.com>, "Nirranjan Kirubaharan"
> >>>> <nirranjan@chelsio.com>
> >>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
> >> tries
> >>>> to read beyond MPA message
> >>>>
> >>>> On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard
> >> Metzler
> >>>> wrote:
> >>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
> >>>>>
> >>>>>> To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
> >>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >>>>>> Date: 08/05/2019 07:26PM
> >>>>>> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
> >>>>>> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
> >>>>>> <bharat@chelsio.com>, "Nirranjan Kirubaharan"
> >>>> <nirranjan@chelsio.com>
> >>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
> >>>> tries
> >>>>>> to read beyond MPA message
> >>>>>>
> >>>>>> On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom Talpey
> >>>> wrote:
> >>>>>>> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
> >>>>>>>> -----"Tom Talpey" <tom@talpey.com> wrote: -----
> >>>>>>>>
> >>>>>>>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
> >>>>>> Eraparaju"
> >>>>>>>>> <krishna2@chelsio.com>
> >>>>>>>>> From: "Tom Talpey" <tom@talpey.com>
> >>>>>>>>> Date: 08/01/2019 08:54PM
> >>>>>>>>> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
> >>>>>> bharat@chelsio.com,
> >>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com
> >>>>>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
> >>>>>> tries
> >>>>>>>>> to read beyond MPA message
> >>>>>>>>>
> >>>>>>>>> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
> >>>>>>>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
> >>>>>> -----
> >>>>>>>>>>
> >>>>>>>>>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
> >>>>>>>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >>>>>>>>>>> Date: 07/31/2019 12:34PM
> >>>>>>>>>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
> >>>>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com, "Krishnamraju
> >>>>>>>>> Eraparaju"
> >>>>>>>>>>> <krishna2@chelsio.com>
> >>>>>>>>>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler
> >>>>>> tries to
> >>>>>>>>>>> read beyond MPA message
> >>>>>>>>>>>
> >>>>>>>>>>> while processing MPA Reply, SIW driver is trying to read
> >>>> extra
> >>>>>> 4
> >>>>>>>>>>> bytes
> >>>>>>>>>>> than what peer has advertised as private data length.
> >>>>>>>>>>>
> >>>>>>>>>>> If a FPDU data is received before even siw_recv_mpa_rr()
> >>>>>> completed
> >>>>>>>>>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr()
> >>>>>> could
> >>>>>>>>> also
> >>>>>>>>>>> read FPDU, if "size" is larger than advertised MPA reply
> >>>>>> length.
> >>>>>>>>>>>
> >>>>>>>>>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
> >>>>>>>>>>> 502 {
> >>>>>>>>>>>              .............
> >>>>>>>>>>> 572
> >>>>>>>>>>> 573         if (rcvd > to_rcv)
> >>>>>>>>>>> 574                 return -EPROTO;   <----- Failure here
> >>>>>>>>>>>
> >>>>>>>>>>> Looks like the intention here is to throw an ERROR if the
> >>>>>> received
> >>>>>>>>>>> data
> >>>>>>>>>>> is more than the total private data length advertised by
> >>>> the
> >>>>>> peer.
> >>>>>>>>>>> But
> >>>>>>>>>>> reading beyond MPA message causes siw_cm to generate
> >>>>>>>>>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv
> >>>> buffer
> >>>>>> is
> >>>>>>>>>>> already
> >>>>>>>>>>> queued with FPDU messages.
> >>>>>>>>>>>
> >>>>>>>>>>> Hence, this function should only read upto private data
> >>>>>> length.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Krishnamraju Eraparaju
> >>>> <krishna2@chelsio.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
> >>>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> >>>>>>>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
> >>>>>>>>>>> index a7cde98e73e8..8dc8cea2566c 100644
> >>>>>>>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
> >>>>>>>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
> >>>>>>>>>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
> >>>>>> siw_cep
> >>>>>>>>> *cep)
> >>>>>>>>>>> 	 * A private data buffer gets allocated if
> >>>> hdr->params.pd_len
> >>>>>> !=
> >>>>>>>>> 0.
> >>>>>>>>>>> 	 */
> >>>>>>>>>>> 	if (!cep->mpa.pdata) {
> >>>>>>>>>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
> >>>>>>>>>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
> >>>>>>>>>>> 		if (!cep->mpa.pdata)
> >>>>>>>>>>> 			return -ENOMEM;
> >>>>>>>>>>> 	}
> >>>>>>>>>>> 	rcvd = ksock_recv(
> >>>>>>>>>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct
> >>>>>> mpa_rr),
> >>>>>>>>>>> -		to_rcv + 4, MSG_DONTWAIT);
> >>>>>>>>>>> +		to_rcv, MSG_DONTWAIT);
> >>>>>>>>>>>
> >>>>>>>>>>> 	if (rcvd < 0)
> >>>>>>>>>>> 		return rcvd;
> >>>>>>>>>>> -- 
> >>>>>>>>>>> 2.23.0.rc0
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The intention of this code is to make sure the
> >>>>>>>>>> peer does not violates the MPA handshake rules.
> >>>>>>>>>> The initiator MUST NOT send extra data after its
> >>>>>>>>>> MPA request and before receiving the MPA reply.
> >>>>>>>>>
> >>>>>>>>> I think this is true only for MPA v2. With MPA v1, the
> >>>>>>>>> initiator can begin sending immediately (before receiving
> >>>>>>>>> the MPA reply), because there is no actual negotiation at
> >>>>>>>>> the MPA layer.
> >>>>>>>>>
> >>>>>>>>> With MPA v2, the negotiation exchange is required because
> >>>>>>>>> the type of the following message is predicated by the
> >>>>>>>>> "Enhanced mode" a|b|c|d flags present in the first 32 bits
> >>>>>>>>> of the private data buffer.
> >>>>>>>>>
> >>>>>>>>> So, it seems to me that additional logic is needed here to
> >>>>>>>>> determine the MPA version, before sniffing additional octets
> >>>>>>>> >from the incoming stream?
> >>>>>>>>>
> >>>>>>>>> Tom.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> There still is the marker negotiation taking place.
> >>>>>>>> RFC 5044 says in section 7.1.2:
> >>>>>>>>
> >>>>>>>> "Note: Since the receiver's ability to deal with Markers is
> >>>>>>>>     unknown until the Request and Reply Frames have been
> >>>>>>>>     received, sending FPDUs before this occurs is not
> >> possible."
> >>>>>>>>
> >>>>>>>> This section further discusses the responder's behavior,
> >>>>>>>> where it MUST receive a first FPDU from the initiator
> >>>>>>>> before sending its first FPDU:
> >>>>>>>>
> >>>>>>>> "4.  MPA Responder mode implementations MUST receive and
> >>>> validate
> >>>>>> at
> >>>>>>>>           least one FPDU before sending any FPDUs or Markers."
> >>>>>>>>
> >>>>>>>> So it appears with MPA version 1, the current siw
> >>>>>>>> code is correct. The initiator is entering FPDU mode
> >>>>>>>> first, and only after receiving the MPA reply frame.
> >>>>>>>> Only after the initiator sent a first FPDU, the responder
> >>>>>>>> can start using the connection in FPDU mode.
> >>>>>>>> Because of this somehow broken connection establishment
> >>>>>>>> procedure (only the initiator can start sending data), a
> >>>>>>>> later MPA version makes this first FPDU exchange explicit
> >>>>>>>> and selectable (zero length READ/WRITE/Send).
> >>>>>>>
> >>>>>>> Yeah, I guess so. Because nobody ever actually implemented
> >>>> markers,
> >>>>>>> I think that they may more or less passively ignore this. But
> >>>>>> you're
> >>>>>>> currect that it's invalid protocol behavior.
> >>>>>>>
> >>>>>>> If your testing didn't uncover any issues with existing
> >>>>>> implementations
> >>>>>>> failing to connect with your strict checking, I'm ok with it.
> >>>>>> Tom & Bernard,
> >>>>>> Thanks for the insight on MPA negotiation.
> >>>>>>
> >>>>>> Could the below patch be considered as a proper fix?
> >>>>>>
> >>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> >>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
> >>>>>> index 9ce8a1b925d2..0aec1b5212f9 100644
> >>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
> >>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
> >>>>>> @@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct siw_cep
> >>>> *cep)
> >>>>>>          struct socket *s = cep->sock;
> >>>>>>          u16 pd_len;
> >>>>>>          int rcvd, to_rcv;
> >>>>>> +       int extra_data_check = 4; /* 4Bytes, for MPA rules
> >>>> violation
> >>>>>> checking */
> >>>>>>
> >>>>>>          if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
> >>>>>>                  rcvd = ksock_recv(s, (char *)hdr +
> >>>>>> cep->mpa.bytes_rcvd,
> >>>>>> @@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct siw_cep
> >>>> *cep)
> >>>>>>                  return -EPROTO;
> >>>>>>          }
> >>>>>>
> >>>>>> +       /*
> >>>>>> +        * Peer must not send any extra data other than MPA
> >>>> messages
> >>>>>> until MPA
> >>>>>> +        * negotiation is completed, an exception is MPA V2
> >>>>>> client-server Mode,
> >>>>>> +        * IE, in this mode the peer after sending MPA Reply can
> >>>>>> immediately
> >>>>>> +        * start sending data in RDMA mode.
> >>>>>> +        */
> >>>>>
> >>>>> This is unfortunately not true. The responder NEVER sends
> >>>>> an FPDU without having seen an FPDU from the initiator.
> >>>>> I just checked RFC 6581 again. The RTR (ready-to-receive)
> >>>>> indication from the initiator is still needed, but now
> >>>>> provided by the protocol and not the application: w/o
> >>>>> 'enhanced MPA setup', the initiator sends the first
> >>>>> FPDU as an application message. With 'enhanced MPA setup',
> >>>>> the initiator protocol entity sends (w/o application
> >>>>> interaction) a zero length READ/WRITE/Send as a first FPDU,
> >>>>> as previously negotiated with the responder. Again: only
> >>>>> after the responder has seen the first FPDU, it can
> >>>>> start sending in FPDU mode.
> >>>> If I understand your statements correctly: in MPA v2 clint-server
> >>>> mode,
> >>>> the responder application should have some logic to wait(after
> >>>> ESTABLISH
> >>>> event) until the first FPDU message is received from the
> >> initiator?
> >>>>
> >>>> Thanks,
> >>>> Krishna.
> >>>
> >>> The responder may either delay the established event
> >>> until it sees the first peer FPDU, or it delays sending
> >>> the first (now already posted) FPDU, until it sees the
> >>> first initiator FPDU. This, since (theoretically), the
> >>> responder does not know yet the result of CRC and
> >>> Marker negotiation. siw delays the established event,
> >>> which seems the more clear path to me.
> >>>
> >>> This MPA handshake is cumbersome. siw is permissive
> >>> for violations of these rules to the extend, that the
> >>> responder shall not send back to back the first FPDU
> >>> with the MPA reply. If a (Chelsio ?) iWarp adapter
> >>> would send the first responder FPDU only after it
> >>> got TCP acked its MPA reply, the chances are high the
> >>> (siw) iWarp initiator has already transitioned to RTS
> >>> and it would accept that FPU, even if not having sent
> >>> it's first FPDU.
> >>>
> >>> siw sticks to this rule to that extend, since all receive
> >>> processing is triggered by socket callbacks (from
> >>> sk->sk_data_ready()). As long as the QP is not
> >>> in RTS, all TCP bytestream processing is done by the
> >>> MPA protocol (code in siw_cm.c). If the QP reaches
> >>> RTS, RX processing is done bt the full iWarp
> >>> protocol (code in siw_qp_rx.c for RX).
> >>> Now, if the higher layer (application) protocol has
> >>> a semantic where the responder sends the first message,
> >>> we could end up in an infinite wait for that packet
> >>> at initiator application side.
> >>> This, since the complete first FPDU was already
> >>> received at the TCP socket, while the QP was not in
> >>> RTS. It will not trigger any additional sk->sk_data_ready()
> >>> and we are stuck with a FPDU in the socket rx buffer.
> >>>
> >>> I implemented that 4 bytes extra data testing only to
> >>> keep the siw protocol state machine as simple as
> >>> possible (it is already unfortunately complex, if
> >>> you count the lines in siw_cm.c), while following
> >>> the protocol rules.
> >>>
> >>> I suggest to correctly implement peer2peer mode and
> >>> delay the established event at the responder side until
> >>> it got the negotiated first zero length FPDU. Out
> >>> of the possible options, siw supports both zero length
> >>> WRITE and READ, but no SEND, since this would consume
> >>> an extra receive WQE.
> >>>
> >>> As a last resort, I might consider extending the
> >>> siw state machine to handle those - non-complaint -
> >>> cases gracefully. But, that would result in different
> >>> code than just avoiding checking for peer protocol
> >>> violation.
> >>
> >> Bernard, everything you describe in siw seems perfectly valid
> >> to me, and in fact that's why the MPA enhanced connection mode
> >> was designed the way it is. I disagree that it's "cumbersome",
> >> but that's a nit.
> >>
> >> The issue is that the responder reaches RTS at a different time
> >> than the initiator reaches RTR. The original iWARP connection
> >> model did not make any requirement, and races were possible.
> >> MPAv2 fixes those.
> >>
> > Tom, thanks, exactly. MPAv2 fixes it. And an implementation

Thanks Bernard & Tom,

Could you also please consider making MPA ehanced connection
mode(with RTR handshake) as default for SIW, as MPA V2 peer-
to-peer mode seems to be more promising that MPA V2
client-server Mode, wrt race conditions.

Currently SIW has MPA V2 client-server Mode as default.


In siw_main.c:
         const bool peer_to_peer = 1;


Between, as per my observations, the chances of hitting this 'connect
error' issue is higher with:   SIW(initator)<--->(responder)SIW
than:    SIW(initator) <---> (responder)hard-iwarp


> > of that fix MUST adhere to the defined rules. If one
> > negotiates an extra handshake to synchronize, it
> > MUST adhere to that handshake rules. There is no point in
> > negotiating an extra handshake, and right away ignoring it.
> > If the responder wants a zero length WRITE, it MUST use
> > its reception to synchronize its transition to RTS.
> > 
> > Shall we extend the siw state machine to support silly
> > peer behavior? As said, if I read there is no way of
> 
> Of course the right answer to this is "no", but I'd reserve
> drawing that conclusion until I am certain it won't cause
> trouble. If existing iWARP implementations don't enforce
> this, and SIW becomes the odd duck failing connections, then
> I'm afraid I would have to say "yes" instead.
> 
> > making the peer implementation RFC compliant (it would
> > be unexpected), I'll look into the applicability of
> > Postel's Rule. But let's make those things explicit.
> 
> Explicit is best, agreed.
> 
> Tom.
> 
> > 
> >> SIW is certainly within its rights to prevent peers from
> >> sending prior to responder RTR. I'd suggest that if possible,
> >> you try to detect the deadlock rather than flatly rejecting any
> >> incoming bytes. The Internet Principle (be liberal in what you
> >> accept...), after all.
> >>
> >> This same dance happens in IB/RoCE, btw. Just via different
> >> messages.
> >>
> > Interesting!
> > 
> > Thanks
> > Bernard.
> > 
> >> Tom.
> >>
> >>
> >>>
> >>> Thanks,
> >>> Bernard.
> >>>
> >>>
> >>>>>
> >>>>> Sorry for the confusion. But the current
> >>>>> siw code appears to be just correct.
> >>>>>
> >>>>> Thanks
> >>>>> Bernard
> >>>>>
> >>>>>
> >>>>>
> >>>>>> +       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
> >>>>>> MPA_REVISION_2) &&
> >>>>>> +               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
> >>>>>> +               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord &
> >>>>>> +                               (MPA_V2_RDMA_WRITE_RTR |
> >>>>>> MPA_V2_RDMA_READ_RTR);
> >>>>>> +               if (!mpa_p2p_mode)
> >>>>>> +                       extra_data_check = 0;
> >>>>>> +       }
> >>>>>> +
> >>>>>>          /*
> >>>>>>           * At this point, we must have hdr->params.pd_len != 0.
> >>>>>>           * A private data buffer gets allocated if
> >>>> hdr->params.pd_len
> >>>>>> !=
> >>>>>>           * 0.
> >>>>>>           */
> >>>>>>          if (!cep->mpa.pdata) {
> >>>>>> -               cep->mpa.pdata = kmalloc(pd_len + 4,
> >> GFP_KERNEL);
> >>>>>> +               cep->mpa.pdata = kmalloc(pd_len +
> >>>> extra_data_check,
> >>>>>> GFP_KERNEL);
> >>>>>>                  if (!cep->mpa.pdata)
> >>>>>>                          return -ENOMEM;
> >>>>>>          }
> >>>>>>          rcvd = ksock_recv(
> >>>>>>                  s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
> >>>>>> sizeof(struct
> >>>>>> mpa_rr),
> >>>>>> -               to_rcv + 4, MSG_DONTWAIT);
> >>>>>> +               to_rcv + extra_data_check, MSG_DONTWAIT);
> >>>>>>
> >>>>>>          if (rcvd < 0)
> >>>>>>                  return rcvd;
> >>>>>>
> >>>>>> -       if (rcvd > to_rcv)
> >>>>>> +       if (extra_data_check && (rcvd > to_rcv))
> >>>>>>                  return -EPROTO;
> >>>>>>
> >>>>>>          cep->mpa.bytes_rcvd += rcvd;
> >>>>>>
> >>>>>> -Krishna.
> >>>>>>>
> >>>>>>> Tom.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Bernard.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> So, for the MPA request case, this code is needed
> >>>>>>>>>> to check for protocol correctness.
> >>>>>>>>>> You are right for the MPA reply case - if we are
> >>>>>>>>>> _not_ in peer2peer mode, the peer can immediately
> >>>>>>>>>> start sending data in RDMA mode after its MPA Reply.
> >>>>>>>>>> So we shall add appropriate code to be more specific
> >>>>>>>>>> For an error, we are (1) processing an MPA Request,
> >>>>>>>>>> OR (2) processing an MPA Reply AND we are not expected
> >>>>>>>>>> to send an initial READ/WRITE/Send as negotiated with
> >>>>>>>>>> the peer (peer2peer mode MPA handshake).
> >>>>>>>>>>
> >>>>>>>>>> Just removing this check would make siw more permissive,
> >>>>>>>>>> but to a point where peer MPA protocol errors are
> >>>>>>>>>> tolerated. I am not in favor of that level of
> >>>>>>>>>> forgiveness.
> >>>>>>>>>>
> >>>>>>>>>> If possible, please provide an appropriate patch
> >>>>>>>>>> or (if it causes current issues with another peer
> >>>>>>>>>> iWarp implementation) just run in MPA peer2peer mode,
> >>>>>>>>>> where the current check is appropriate.
> >>>>>>>>>> Otherwise, I would provide an appropriate fix by Monday
> >>>>>>>>>> (I am still out of office this week).
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Many thanks and best regards,
> >>>>>>>>>> Bernard.
> >>>>>>>>>>
> >>
> >>
> > 
> > 
> >
Bernard Metzler Aug. 12, 2019, 12:56 p.m. UTC | #16
-----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----

>To: "Bernard Metzler" <bmt@zurich.ibm.com>
>From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>Date: 08/12/2019 11:59AM
>Cc: "Tom Talpey" <tom@talpey.com>, "jgg@ziepe.ca" <jgg@ziepe.ca>,
>"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>, "Potnuri
>Bharat Teja" <bharat@chelsio.com>, "Nirranjan Kirubaharan"
><nirranjan@chelsio.com>
>Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
>to read beyond MPA message
>
>On Saturday, August 08/10/19, 2019 at 02:05:00 +0530, Tom Talpey
>wrote:
>> On 8/9/2019 9:52 AM, Bernard Metzler wrote:
>> > -----"Tom Talpey" <tom@talpey.com> wrote: -----
>> > 
>> >> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
>Eraparaju"
>> >> <krishna2@chelsio.com>
>> >> From: "Tom Talpey" <tom@talpey.com>
>> >> Date: 08/09/2019 02:27PM
>> >> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
>> >> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
>> >> <bharat@chelsio.com>, "Nirranjan Kirubaharan"
><nirranjan@chelsio.com>
>> >> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>tries
>> >> to read beyond MPA message
>> >>
>> >> On 8/9/2019 6:29 AM, Bernard Metzler wrote:
>> >>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
>-----
>> >>>
>> >>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>
>> >>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>> >>>> Date: 08/08/2019 06:47PM
>> >>>> Cc: "Tom Talpey" <tom@talpey.com>, "jgg@ziepe.ca"
><jgg@ziepe.ca>,
>> >>>> "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
>> >> "Potnuri
>> >>>> Bharat Teja" <bharat@chelsio.com>, "Nirranjan Kirubaharan"
>> >>>> <nirranjan@chelsio.com>
>> >>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
>> >> tries
>> >>>> to read beyond MPA message
>> >>>>
>> >>>> On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard
>> >> Metzler
>> >>>> wrote:
>> >>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
>-----
>> >>>>>
>> >>>>>> To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
>> >>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>> >>>>>> Date: 08/05/2019 07:26PM
>> >>>>>> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>,
>"linux-rdma@vger.kernel.org"
>> >>>>>> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
>> >>>>>> <bharat@chelsio.com>, "Nirranjan Kirubaharan"
>> >>>> <nirranjan@chelsio.com>
>> >>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply
>handler
>> >>>> tries
>> >>>>>> to read beyond MPA message
>> >>>>>>
>> >>>>>> On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom
>Talpey
>> >>>> wrote:
>> >>>>>>> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
>> >>>>>>>> -----"Tom Talpey" <tom@talpey.com> wrote: -----
>> >>>>>>>>
>> >>>>>>>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
>> >>>>>> Eraparaju"
>> >>>>>>>>> <krishna2@chelsio.com>
>> >>>>>>>>> From: "Tom Talpey" <tom@talpey.com>
>> >>>>>>>>> Date: 08/01/2019 08:54PM
>> >>>>>>>>> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
>> >>>>>> bharat@chelsio.com,
>> >>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com
>> >>>>>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply
>handler
>> >>>>>> tries
>> >>>>>>>>> to read beyond MPA message
>> >>>>>>>>>
>> >>>>>>>>> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
>> >>>>>>>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com>
>wrote:
>> >>>>>> -----
>> >>>>>>>>>>
>> >>>>>>>>>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
>> >>>>>>>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
>> >>>>>>>>>>> Date: 07/31/2019 12:34PM
>> >>>>>>>>>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
>> >>>>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com,
>"Krishnamraju
>> >>>>>>>>> Eraparaju"
>> >>>>>>>>>>> <krishna2@chelsio.com>
>> >>>>>>>>>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply
>handler
>> >>>>>> tries to
>> >>>>>>>>>>> read beyond MPA message
>> >>>>>>>>>>>
>> >>>>>>>>>>> while processing MPA Reply, SIW driver is trying to
>read
>> >>>> extra
>> >>>>>> 4
>> >>>>>>>>>>> bytes
>> >>>>>>>>>>> than what peer has advertised as private data length.
>> >>>>>>>>>>>
>> >>>>>>>>>>> If a FPDU data is received before even
>siw_recv_mpa_rr()
>> >>>>>> completed
>> >>>>>>>>>>> reading MPA reply, then ksock_recv() in
>siw_recv_mpa_rr()
>> >>>>>> could
>> >>>>>>>>> also
>> >>>>>>>>>>> read FPDU, if "size" is larger than advertised MPA
>reply
>> >>>>>> length.
>> >>>>>>>>>>>
>> >>>>>>>>>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
>> >>>>>>>>>>> 502 {
>> >>>>>>>>>>>              .............
>> >>>>>>>>>>> 572
>> >>>>>>>>>>> 573         if (rcvd > to_rcv)
>> >>>>>>>>>>> 574                 return -EPROTO;   <----- Failure
>here
>> >>>>>>>>>>>
>> >>>>>>>>>>> Looks like the intention here is to throw an ERROR if
>the
>> >>>>>> received
>> >>>>>>>>>>> data
>> >>>>>>>>>>> is more than the total private data length advertised
>by
>> >>>> the
>> >>>>>> peer.
>> >>>>>>>>>>> But
>> >>>>>>>>>>> reading beyond MPA message causes siw_cm to generate
>> >>>>>>>>>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv
>> >>>> buffer
>> >>>>>> is
>> >>>>>>>>>>> already
>> >>>>>>>>>>> queued with FPDU messages.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hence, this function should only read upto private data
>> >>>>>> length.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Signed-off-by: Krishnamraju Eraparaju
>> >>>> <krishna2@chelsio.com>
>> >>>>>>>>>>> ---
>> >>>>>>>>>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
>> >>>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> >>>>>>>>>>>
>> >>>>>>>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>> >>>>>>>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>> >>>>>>>>>>> index a7cde98e73e8..8dc8cea2566c 100644
>> >>>>>>>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>> >>>>>>>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>> >>>>>>>>>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
>> >>>>>> siw_cep
>> >>>>>>>>> *cep)
>> >>>>>>>>>>> 	 * A private data buffer gets allocated if
>> >>>> hdr->params.pd_len
>> >>>>>> !=
>> >>>>>>>>> 0.
>> >>>>>>>>>>> 	 */
>> >>>>>>>>>>> 	if (!cep->mpa.pdata) {
>> >>>>>>>>>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
>> >>>>>>>>>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
>> >>>>>>>>>>> 		if (!cep->mpa.pdata)
>> >>>>>>>>>>> 			return -ENOMEM;
>> >>>>>>>>>>> 	}
>> >>>>>>>>>>> 	rcvd = ksock_recv(
>> >>>>>>>>>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
>sizeof(struct
>> >>>>>> mpa_rr),
>> >>>>>>>>>>> -		to_rcv + 4, MSG_DONTWAIT);
>> >>>>>>>>>>> +		to_rcv, MSG_DONTWAIT);
>> >>>>>>>>>>>
>> >>>>>>>>>>> 	if (rcvd < 0)
>> >>>>>>>>>>> 		return rcvd;
>> >>>>>>>>>>> -- 
>> >>>>>>>>>>> 2.23.0.rc0
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> The intention of this code is to make sure the
>> >>>>>>>>>> peer does not violates the MPA handshake rules.
>> >>>>>>>>>> The initiator MUST NOT send extra data after its
>> >>>>>>>>>> MPA request and before receiving the MPA reply.
>> >>>>>>>>>
>> >>>>>>>>> I think this is true only for MPA v2. With MPA v1, the
>> >>>>>>>>> initiator can begin sending immediately (before receiving
>> >>>>>>>>> the MPA reply), because there is no actual negotiation at
>> >>>>>>>>> the MPA layer.
>> >>>>>>>>>
>> >>>>>>>>> With MPA v2, the negotiation exchange is required because
>> >>>>>>>>> the type of the following message is predicated by the
>> >>>>>>>>> "Enhanced mode" a|b|c|d flags present in the first 32
>bits
>> >>>>>>>>> of the private data buffer.
>> >>>>>>>>>
>> >>>>>>>>> So, it seems to me that additional logic is needed here
>to
>> >>>>>>>>> determine the MPA version, before sniffing additional
>octets
>> >>>>>>>> >from the incoming stream?
>> >>>>>>>>>
>> >>>>>>>>> Tom.
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> There still is the marker negotiation taking place.
>> >>>>>>>> RFC 5044 says in section 7.1.2:
>> >>>>>>>>
>> >>>>>>>> "Note: Since the receiver's ability to deal with Markers
>is
>> >>>>>>>>     unknown until the Request and Reply Frames have been
>> >>>>>>>>     received, sending FPDUs before this occurs is not
>> >> possible."
>> >>>>>>>>
>> >>>>>>>> This section further discusses the responder's behavior,
>> >>>>>>>> where it MUST receive a first FPDU from the initiator
>> >>>>>>>> before sending its first FPDU:
>> >>>>>>>>
>> >>>>>>>> "4.  MPA Responder mode implementations MUST receive and
>> >>>> validate
>> >>>>>> at
>> >>>>>>>>           least one FPDU before sending any FPDUs or
>Markers."
>> >>>>>>>>
>> >>>>>>>> So it appears with MPA version 1, the current siw
>> >>>>>>>> code is correct. The initiator is entering FPDU mode
>> >>>>>>>> first, and only after receiving the MPA reply frame.
>> >>>>>>>> Only after the initiator sent a first FPDU, the responder
>> >>>>>>>> can start using the connection in FPDU mode.
>> >>>>>>>> Because of this somehow broken connection establishment
>> >>>>>>>> procedure (only the initiator can start sending data), a
>> >>>>>>>> later MPA version makes this first FPDU exchange explicit
>> >>>>>>>> and selectable (zero length READ/WRITE/Send).
>> >>>>>>>
>> >>>>>>> Yeah, I guess so. Because nobody ever actually implemented
>> >>>> markers,
>> >>>>>>> I think that they may more or less passively ignore this.
>But
>> >>>>>> you're
>> >>>>>>> currect that it's invalid protocol behavior.
>> >>>>>>>
>> >>>>>>> If your testing didn't uncover any issues with existing
>> >>>>>> implementations
>> >>>>>>> failing to connect with your strict checking, I'm ok with
>it.
>> >>>>>> Tom & Bernard,
>> >>>>>> Thanks for the insight on MPA negotiation.
>> >>>>>>
>> >>>>>> Could the below patch be considered as a proper fix?
>> >>>>>>
>> >>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>> >>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>> >>>>>> index 9ce8a1b925d2..0aec1b5212f9 100644
>> >>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>> >>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>> >>>>>> @@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct
>siw_cep
>> >>>> *cep)
>> >>>>>>          struct socket *s = cep->sock;
>> >>>>>>          u16 pd_len;
>> >>>>>>          int rcvd, to_rcv;
>> >>>>>> +       int extra_data_check = 4; /* 4Bytes, for MPA rules
>> >>>> violation
>> >>>>>> checking */
>> >>>>>>
>> >>>>>>          if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
>> >>>>>>                  rcvd = ksock_recv(s, (char *)hdr +
>> >>>>>> cep->mpa.bytes_rcvd,
>> >>>>>> @@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct
>siw_cep
>> >>>> *cep)
>> >>>>>>                  return -EPROTO;
>> >>>>>>          }
>> >>>>>>
>> >>>>>> +       /*
>> >>>>>> +        * Peer must not send any extra data other than MPA
>> >>>> messages
>> >>>>>> until MPA
>> >>>>>> +        * negotiation is completed, an exception is MPA V2
>> >>>>>> client-server Mode,
>> >>>>>> +        * IE, in this mode the peer after sending MPA Reply
>can
>> >>>>>> immediately
>> >>>>>> +        * start sending data in RDMA mode.
>> >>>>>> +        */
>> >>>>>
>> >>>>> This is unfortunately not true. The responder NEVER sends
>> >>>>> an FPDU without having seen an FPDU from the initiator.
>> >>>>> I just checked RFC 6581 again. The RTR (ready-to-receive)
>> >>>>> indication from the initiator is still needed, but now
>> >>>>> provided by the protocol and not the application: w/o
>> >>>>> 'enhanced MPA setup', the initiator sends the first
>> >>>>> FPDU as an application message. With 'enhanced MPA setup',
>> >>>>> the initiator protocol entity sends (w/o application
>> >>>>> interaction) a zero length READ/WRITE/Send as a first FPDU,
>> >>>>> as previously negotiated with the responder. Again: only
>> >>>>> after the responder has seen the first FPDU, it can
>> >>>>> start sending in FPDU mode.
>> >>>> If I understand your statements correctly: in MPA v2
>clint-server
>> >>>> mode,
>> >>>> the responder application should have some logic to wait(after
>> >>>> ESTABLISH
>> >>>> event) until the first FPDU message is received from the
>> >> initiator?
>> >>>>
>> >>>> Thanks,
>> >>>> Krishna.
>> >>>
>> >>> The responder may either delay the established event
>> >>> until it sees the first peer FPDU, or it delays sending
>> >>> the first (now already posted) FPDU, until it sees the
>> >>> first initiator FPDU. This, since (theoretically), the
>> >>> responder does not know yet the result of CRC and
>> >>> Marker negotiation. siw delays the established event,
>> >>> which seems the more clear path to me.
>> >>>
>> >>> This MPA handshake is cumbersome. siw is permissive
>> >>> for violations of these rules to the extend, that the
>> >>> responder shall not send back to back the first FPDU
>> >>> with the MPA reply. If a (Chelsio ?) iWarp adapter
>> >>> would send the first responder FPDU only after it
>> >>> got TCP acked its MPA reply, the chances are high the
>> >>> (siw) iWarp initiator has already transitioned to RTS
>> >>> and it would accept that FPU, even if not having sent
>> >>> it's first FPDU.
>> >>>
>> >>> siw sticks to this rule to that extend, since all receive
>> >>> processing is triggered by socket callbacks (from
>> >>> sk->sk_data_ready()). As long as the QP is not
>> >>> in RTS, all TCP bytestream processing is done by the
>> >>> MPA protocol (code in siw_cm.c). If the QP reaches
>> >>> RTS, RX processing is done bt the full iWarp
>> >>> protocol (code in siw_qp_rx.c for RX).
>> >>> Now, if the higher layer (application) protocol has
>> >>> a semantic where the responder sends the first message,
>> >>> we could end up in an infinite wait for that packet
>> >>> at initiator application side.
>> >>> This, since the complete first FPDU was already
>> >>> received at the TCP socket, while the QP was not in
>> >>> RTS. It will not trigger any additional sk->sk_data_ready()
>> >>> and we are stuck with a FPDU in the socket rx buffer.
>> >>>
>> >>> I implemented that 4 bytes extra data testing only to
>> >>> keep the siw protocol state machine as simple as
>> >>> possible (it is already unfortunately complex, if
>> >>> you count the lines in siw_cm.c), while following
>> >>> the protocol rules.
>> >>>
>> >>> I suggest to correctly implement peer2peer mode and
>> >>> delay the established event at the responder side until
>> >>> it got the negotiated first zero length FPDU. Out
>> >>> of the possible options, siw supports both zero length
>> >>> WRITE and READ, but no SEND, since this would consume
>> >>> an extra receive WQE.
>> >>>
>> >>> As a last resort, I might consider extending the
>> >>> siw state machine to handle those - non-complaint -
>> >>> cases gracefully. But, that would result in different
>> >>> code than just avoiding checking for peer protocol
>> >>> violation.
>> >>
>> >> Bernard, everything you describe in siw seems perfectly valid
>> >> to me, and in fact that's why the MPA enhanced connection mode
>> >> was designed the way it is. I disagree that it's "cumbersome",
>> >> but that's a nit.
>> >>
>> >> The issue is that the responder reaches RTS at a different time
>> >> than the initiator reaches RTR. The original iWARP connection
>> >> model did not make any requirement, and races were possible.
>> >> MPAv2 fixes those.
>> >>
>> > Tom, thanks, exactly. MPAv2 fixes it. And an implementation
>
>Thanks Bernard & Tom,
>
>Could you also please consider making MPA ehanced connection
>mode(with RTR handshake) as default for SIW, as MPA V2 peer-
>to-peer mode seems to be more promising that MPA V2
>client-server Mode, wrt race conditions.
>
>Currently SIW has MPA V2 client-server Mode as default.
>
>
>In siw_main.c:
>         const bool peer_to_peer = 1;
>
>
>Between, as per my observations, the chances of hitting this 'connect
>error' issue is higher with:   SIW(initator)<--->(responder)SIW
>than:    SIW(initator) <---> (responder)hard-iwarp

Good to know! Yes, siw is permissive at responder side, it's
applications responsibility to delay the first message, if p2p
mode is not selected.
What application are you using here (which let's
the responder immediately start using the SQ)? Just to let
me re-create the case w/o effort.
I might produce a fix for siw which can handle that
case for the non-p2p-mode case.

Setting p2p mode as default is another option. I am not
immediately happy with that suggestion, since it adds
another wire transfer as default. The future shall see
those parameters settable. These were module parameters,
but we abandoned that. We should probably come up with
an extension to the RDMA netlink interface, which
makes those protocol specific modes and flavors settable
per device...?

Thanks
Bernard.
>
>
>> > of that fix MUST adhere to the defined rules. If one
>> > negotiates an extra handshake to synchronize, it
>> > MUST adhere to that handshake rules. There is no point in
>> > negotiating an extra handshake, and right away ignoring it.
>> > If the responder wants a zero length WRITE, it MUST use
>> > its reception to synchronize its transition to RTS.
>> > 
>> > Shall we extend the siw state machine to support silly
>> > peer behavior? As said, if I read there is no way of
>> 
>> Of course the right answer to this is "no", but I'd reserve
>> drawing that conclusion until I am certain it won't cause
>> trouble. If existing iWARP implementations don't enforce
>> this, and SIW becomes the odd duck failing connections, then
>> I'm afraid I would have to say "yes" instead.
>> 
>> > making the peer implementation RFC compliant (it would
>> > be unexpected), I'll look into the applicability of
>> > Postel's Rule. But let's make those things explicit.
>> 
>> Explicit is best, agreed.
>> 
>> Tom.
>> 
>> > 
>> >> SIW is certainly within its rights to prevent peers from
>> >> sending prior to responder RTR. I'd suggest that if possible,
>> >> you try to detect the deadlock rather than flatly rejecting any
>> >> incoming bytes. The Internet Principle (be liberal in what you
>> >> accept...), after all.
>> >>
>> >> This same dance happens in IB/RoCE, btw. Just via different
>> >> messages.
>> >>
>> > Interesting!
>> > 
>> > Thanks
>> > Bernard.
>> > 
>> >> Tom.
>> >>
>> >>
>> >>>
>> >>> Thanks,
>> >>> Bernard.
>> >>>
>> >>>
>> >>>>>
>> >>>>> Sorry for the confusion. But the current
>> >>>>> siw code appears to be just correct.
>> >>>>>
>> >>>>> Thanks
>> >>>>> Bernard
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>> +       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
>> >>>>>> MPA_REVISION_2) &&
>> >>>>>> +               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
>> >>>>>> +               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord
>&
>> >>>>>> +                               (MPA_V2_RDMA_WRITE_RTR |
>> >>>>>> MPA_V2_RDMA_READ_RTR);
>> >>>>>> +               if (!mpa_p2p_mode)
>> >>>>>> +                       extra_data_check = 0;
>> >>>>>> +       }
>> >>>>>> +
>> >>>>>>          /*
>> >>>>>>           * At this point, we must have hdr->params.pd_len
>!= 0.
>> >>>>>>           * A private data buffer gets allocated if
>> >>>> hdr->params.pd_len
>> >>>>>> !=
>> >>>>>>           * 0.
>> >>>>>>           */
>> >>>>>>          if (!cep->mpa.pdata) {
>> >>>>>> -               cep->mpa.pdata = kmalloc(pd_len + 4,
>> >> GFP_KERNEL);
>> >>>>>> +               cep->mpa.pdata = kmalloc(pd_len +
>> >>>> extra_data_check,
>> >>>>>> GFP_KERNEL);
>> >>>>>>                  if (!cep->mpa.pdata)
>> >>>>>>                          return -ENOMEM;
>> >>>>>>          }
>> >>>>>>          rcvd = ksock_recv(
>> >>>>>>                  s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
>> >>>>>> sizeof(struct
>> >>>>>> mpa_rr),
>> >>>>>> -               to_rcv + 4, MSG_DONTWAIT);
>> >>>>>> +               to_rcv + extra_data_check, MSG_DONTWAIT);
>> >>>>>>
>> >>>>>>          if (rcvd < 0)
>> >>>>>>                  return rcvd;
>> >>>>>>
>> >>>>>> -       if (rcvd > to_rcv)
>> >>>>>> +       if (extra_data_check && (rcvd > to_rcv))
>> >>>>>>                  return -EPROTO;
>> >>>>>>
>> >>>>>>          cep->mpa.bytes_rcvd += rcvd;
>> >>>>>>
>> >>>>>> -Krishna.
>> >>>>>>>
>> >>>>>>> Tom.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>> Bernard.
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>> So, for the MPA request case, this code is needed
>> >>>>>>>>>> to check for protocol correctness.
>> >>>>>>>>>> You are right for the MPA reply case - if we are
>> >>>>>>>>>> _not_ in peer2peer mode, the peer can immediately
>> >>>>>>>>>> start sending data in RDMA mode after its MPA Reply.
>> >>>>>>>>>> So we shall add appropriate code to be more specific
>> >>>>>>>>>> For an error, we are (1) processing an MPA Request,
>> >>>>>>>>>> OR (2) processing an MPA Reply AND we are not expected
>> >>>>>>>>>> to send an initial READ/WRITE/Send as negotiated with
>> >>>>>>>>>> the peer (peer2peer mode MPA handshake).
>> >>>>>>>>>>
>> >>>>>>>>>> Just removing this check would make siw more permissive,
>> >>>>>>>>>> but to a point where peer MPA protocol errors are
>> >>>>>>>>>> tolerated. I am not in favor of that level of
>> >>>>>>>>>> forgiveness.
>> >>>>>>>>>>
>> >>>>>>>>>> If possible, please provide an appropriate patch
>> >>>>>>>>>> or (if it causes current issues with another peer
>> >>>>>>>>>> iWarp implementation) just run in MPA peer2peer mode,
>> >>>>>>>>>> where the current check is appropriate.
>> >>>>>>>>>> Otherwise, I would provide an appropriate fix by Monday
>> >>>>>>>>>> (I am still out of office this week).
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Many thanks and best regards,
>> >>>>>>>>>> Bernard.
>> >>>>>>>>>>
>> >>
>> >>
>> > 
>> > 
>> > 
>
>
Krishnamraju Eraparaju Aug. 13, 2019, 8:05 a.m. UTC | #17
On Monday, August 08/12/19, 2019 at 12:56:18 +0000, Bernard Metzler wrote:
> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote: -----
> 
> >To: "Bernard Metzler" <bmt@zurich.ibm.com>
> >From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >Date: 08/12/2019 11:59AM
> >Cc: "Tom Talpey" <tom@talpey.com>, "jgg@ziepe.ca" <jgg@ziepe.ca>,
> >"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>, "Potnuri
> >Bharat Teja" <bharat@chelsio.com>, "Nirranjan Kirubaharan"
> ><nirranjan@chelsio.com>
> >Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries
> >to read beyond MPA message
> >
> >On Saturday, August 08/10/19, 2019 at 02:05:00 +0530, Tom Talpey
> >wrote:
> >> On 8/9/2019 9:52 AM, Bernard Metzler wrote:
> >> > -----"Tom Talpey" <tom@talpey.com> wrote: -----
> >> > 
> >> >> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
> >Eraparaju"
> >> >> <krishna2@chelsio.com>
> >> >> From: "Tom Talpey" <tom@talpey.com>
> >> >> Date: 08/09/2019 02:27PM
> >> >> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>, "linux-rdma@vger.kernel.org"
> >> >> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
> >> >> <bharat@chelsio.com>, "Nirranjan Kirubaharan"
> ><nirranjan@chelsio.com>
> >> >> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
> >tries
> >> >> to read beyond MPA message
> >> >>
> >> >> On 8/9/2019 6:29 AM, Bernard Metzler wrote:
> >> >>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
> >-----
> >> >>>
> >> >>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>
> >> >>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >> >>>> Date: 08/08/2019 06:47PM
> >> >>>> Cc: "Tom Talpey" <tom@talpey.com>, "jgg@ziepe.ca"
> ><jgg@ziepe.ca>,
> >> >>>> "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
> >> >> "Potnuri
> >> >>>> Bharat Teja" <bharat@chelsio.com>, "Nirranjan Kirubaharan"
> >> >>>> <nirranjan@chelsio.com>
> >> >>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler
> >> >> tries
> >> >>>> to read beyond MPA message
> >> >>>>
> >> >>>> On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard
> >> >> Metzler
> >> >>>> wrote:
> >> >>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com> wrote:
> >-----
> >> >>>>>
> >> >>>>>> To: "Tom Talpey" <tom@talpey.com>, <BMT@zurich.ibm.com>
> >> >>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >> >>>>>> Date: 08/05/2019 07:26PM
> >> >>>>>> Cc: "jgg@ziepe.ca" <jgg@ziepe.ca>,
> >"linux-rdma@vger.kernel.org"
> >> >>>>>> <linux-rdma@vger.kernel.org>, "Potnuri Bharat Teja"
> >> >>>>>> <bharat@chelsio.com>, "Nirranjan Kirubaharan"
> >> >>>> <nirranjan@chelsio.com>
> >> >>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply
> >handler
> >> >>>> tries
> >> >>>>>> to read beyond MPA message
> >> >>>>>>
> >> >>>>>> On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom
> >Talpey
> >> >>>> wrote:
> >> >>>>>>> On 8/2/2019 7:18 AM, Bernard Metzler wrote:
> >> >>>>>>>> -----"Tom Talpey" <tom@talpey.com> wrote: -----
> >> >>>>>>>>
> >> >>>>>>>>> To: "Bernard Metzler" <BMT@zurich.ibm.com>, "Krishnamraju
> >> >>>>>> Eraparaju"
> >> >>>>>>>>> <krishna2@chelsio.com>
> >> >>>>>>>>> From: "Tom Talpey" <tom@talpey.com>
> >> >>>>>>>>> Date: 08/01/2019 08:54PM
> >> >>>>>>>>> Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
> >> >>>>>> bharat@chelsio.com,
> >> >>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com
> >> >>>>>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply
> >handler
> >> >>>>>> tries
> >> >>>>>>>>> to read beyond MPA message
> >> >>>>>>>>>
> >> >>>>>>>>> On 8/1/2019 6:56 AM, Bernard Metzler wrote:
> >> >>>>>>>>>> -----"Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >wrote:
> >> >>>>>> -----
> >> >>>>>>>>>>
> >> >>>>>>>>>>> To: jgg@ziepe.ca, bmt@zurich.ibm.com
> >> >>>>>>>>>>> From: "Krishnamraju Eraparaju" <krishna2@chelsio.com>
> >> >>>>>>>>>>> Date: 07/31/2019 12:34PM
> >> >>>>>>>>>>> Cc: linux-rdma@vger.kernel.org, bharat@chelsio.com,
> >> >>>>>>>>>>> nirranjan@chelsio.com, krishn2@chelsio.com,
> >"Krishnamraju
> >> >>>>>>>>> Eraparaju"
> >> >>>>>>>>>>> <krishna2@chelsio.com>
> >> >>>>>>>>>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply
> >handler
> >> >>>>>> tries to
> >> >>>>>>>>>>> read beyond MPA message
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> while processing MPA Reply, SIW driver is trying to
> >read
> >> >>>> extra
> >> >>>>>> 4
> >> >>>>>>>>>>> bytes
> >> >>>>>>>>>>> than what peer has advertised as private data length.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> If a FPDU data is received before even
> >siw_recv_mpa_rr()
> >> >>>>>> completed
> >> >>>>>>>>>>> reading MPA reply, then ksock_recv() in
> >siw_recv_mpa_rr()
> >> >>>>>> could
> >> >>>>>>>>> also
> >> >>>>>>>>>>> read FPDU, if "size" is larger than advertised MPA
> >reply
> >> >>>>>> length.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep)
> >> >>>>>>>>>>> 502 {
> >> >>>>>>>>>>>              .............
> >> >>>>>>>>>>> 572
> >> >>>>>>>>>>> 573         if (rcvd > to_rcv)
> >> >>>>>>>>>>> 574                 return -EPROTO;   <----- Failure
> >here
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Looks like the intention here is to throw an ERROR if
> >the
> >> >>>>>> received
> >> >>>>>>>>>>> data
> >> >>>>>>>>>>> is more than the total private data length advertised
> >by
> >> >>>> the
> >> >>>>>> peer.
> >> >>>>>>>>>>> But
> >> >>>>>>>>>>> reading beyond MPA message causes siw_cm to generate
> >> >>>>>>>>>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv
> >> >>>> buffer
> >> >>>>>> is
> >> >>>>>>>>>>> already
> >> >>>>>>>>>>> queued with FPDU messages.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Hence, this function should only read upto private data
> >> >>>>>> length.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Signed-off-by: Krishnamraju Eraparaju
> >> >>>> <krishna2@chelsio.com>
> >> >>>>>>>>>>> ---
> >> >>>>>>>>>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++--
> >> >>>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>>>>>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>>>>>>>>> index a7cde98e73e8..8dc8cea2566c 100644
> >> >>>>>>>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>>>>>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>>>>>>>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct
> >> >>>>>> siw_cep
> >> >>>>>>>>> *cep)
> >> >>>>>>>>>>> 	 * A private data buffer gets allocated if
> >> >>>> hdr->params.pd_len
> >> >>>>>> !=
> >> >>>>>>>>> 0.
> >> >>>>>>>>>>> 	 */
> >> >>>>>>>>>>> 	if (!cep->mpa.pdata) {
> >> >>>>>>>>>>> -		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
> >> >>>>>>>>>>> +		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
> >> >>>>>>>>>>> 		if (!cep->mpa.pdata)
> >> >>>>>>>>>>> 			return -ENOMEM;
> >> >>>>>>>>>>> 	}
> >> >>>>>>>>>>> 	rcvd = ksock_recv(
> >> >>>>>>>>>>> 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
> >sizeof(struct
> >> >>>>>> mpa_rr),
> >> >>>>>>>>>>> -		to_rcv + 4, MSG_DONTWAIT);
> >> >>>>>>>>>>> +		to_rcv, MSG_DONTWAIT);
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 	if (rcvd < 0)
> >> >>>>>>>>>>> 		return rcvd;
> >> >>>>>>>>>>> -- 
> >> >>>>>>>>>>> 2.23.0.rc0
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> The intention of this code is to make sure the
> >> >>>>>>>>>> peer does not violates the MPA handshake rules.
> >> >>>>>>>>>> The initiator MUST NOT send extra data after its
> >> >>>>>>>>>> MPA request and before receiving the MPA reply.
> >> >>>>>>>>>
> >> >>>>>>>>> I think this is true only for MPA v2. With MPA v1, the
> >> >>>>>>>>> initiator can begin sending immediately (before receiving
> >> >>>>>>>>> the MPA reply), because there is no actual negotiation at
> >> >>>>>>>>> the MPA layer.
> >> >>>>>>>>>
> >> >>>>>>>>> With MPA v2, the negotiation exchange is required because
> >> >>>>>>>>> the type of the following message is predicated by the
> >> >>>>>>>>> "Enhanced mode" a|b|c|d flags present in the first 32
> >bits
> >> >>>>>>>>> of the private data buffer.
> >> >>>>>>>>>
> >> >>>>>>>>> So, it seems to me that additional logic is needed here
> >to
> >> >>>>>>>>> determine the MPA version, before sniffing additional
> >octets
> >> >>>>>>>> >from the incoming stream?
> >> >>>>>>>>>
> >> >>>>>>>>> Tom.
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> There still is the marker negotiation taking place.
> >> >>>>>>>> RFC 5044 says in section 7.1.2:
> >> >>>>>>>>
> >> >>>>>>>> "Note: Since the receiver's ability to deal with Markers
> >is
> >> >>>>>>>>     unknown until the Request and Reply Frames have been
> >> >>>>>>>>     received, sending FPDUs before this occurs is not
> >> >> possible."
> >> >>>>>>>>
> >> >>>>>>>> This section further discusses the responder's behavior,
> >> >>>>>>>> where it MUST receive a first FPDU from the initiator
> >> >>>>>>>> before sending its first FPDU:
> >> >>>>>>>>
> >> >>>>>>>> "4.  MPA Responder mode implementations MUST receive and
> >> >>>> validate
> >> >>>>>> at
> >> >>>>>>>>           least one FPDU before sending any FPDUs or
> >Markers."
> >> >>>>>>>>
> >> >>>>>>>> So it appears with MPA version 1, the current siw
> >> >>>>>>>> code is correct. The initiator is entering FPDU mode
> >> >>>>>>>> first, and only after receiving the MPA reply frame.
> >> >>>>>>>> Only after the initiator sent a first FPDU, the responder
> >> >>>>>>>> can start using the connection in FPDU mode.
> >> >>>>>>>> Because of this somehow broken connection establishment
> >> >>>>>>>> procedure (only the initiator can start sending data), a
> >> >>>>>>>> later MPA version makes this first FPDU exchange explicit
> >> >>>>>>>> and selectable (zero length READ/WRITE/Send).
> >> >>>>>>>
> >> >>>>>>> Yeah, I guess so. Because nobody ever actually implemented
> >> >>>> markers,
> >> >>>>>>> I think that they may more or less passively ignore this.
> >But
> >> >>>>>> you're
> >> >>>>>>> currect that it's invalid protocol behavior.
> >> >>>>>>>
> >> >>>>>>> If your testing didn't uncover any issues with existing
> >> >>>>>> implementations
> >> >>>>>>> failing to connect with your strict checking, I'm ok with
> >it.
> >> >>>>>> Tom & Bernard,
> >> >>>>>> Thanks for the insight on MPA negotiation.
> >> >>>>>>
> >> >>>>>> Could the below patch be considered as a proper fix?
> >> >>>>>>
> >> >>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>>>> b/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>>>> index 9ce8a1b925d2..0aec1b5212f9 100644
> >> >>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
> >> >>>>>> @@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct
> >siw_cep
> >> >>>> *cep)
> >> >>>>>>          struct socket *s = cep->sock;
> >> >>>>>>          u16 pd_len;
> >> >>>>>>          int rcvd, to_rcv;
> >> >>>>>> +       int extra_data_check = 4; /* 4Bytes, for MPA rules
> >> >>>> violation
> >> >>>>>> checking */
> >> >>>>>>
> >> >>>>>>          if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
> >> >>>>>>                  rcvd = ksock_recv(s, (char *)hdr +
> >> >>>>>> cep->mpa.bytes_rcvd,
> >> >>>>>> @@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct
> >siw_cep
> >> >>>> *cep)
> >> >>>>>>                  return -EPROTO;
> >> >>>>>>          }
> >> >>>>>>
> >> >>>>>> +       /*
> >> >>>>>> +        * Peer must not send any extra data other than MPA
> >> >>>> messages
> >> >>>>>> until MPA
> >> >>>>>> +        * negotiation is completed, an exception is MPA V2
> >> >>>>>> client-server Mode,
> >> >>>>>> +        * IE, in this mode the peer after sending MPA Reply
> >can
> >> >>>>>> immediately
> >> >>>>>> +        * start sending data in RDMA mode.
> >> >>>>>> +        */
> >> >>>>>
> >> >>>>> This is unfortunately not true. The responder NEVER sends
> >> >>>>> an FPDU without having seen an FPDU from the initiator.
> >> >>>>> I just checked RFC 6581 again. The RTR (ready-to-receive)
> >> >>>>> indication from the initiator is still needed, but now
> >> >>>>> provided by the protocol and not the application: w/o
> >> >>>>> 'enhanced MPA setup', the initiator sends the first
> >> >>>>> FPDU as an application message. With 'enhanced MPA setup',
> >> >>>>> the initiator protocol entity sends (w/o application
> >> >>>>> interaction) a zero length READ/WRITE/Send as a first FPDU,
> >> >>>>> as previously negotiated with the responder. Again: only
> >> >>>>> after the responder has seen the first FPDU, it can
> >> >>>>> start sending in FPDU mode.
> >> >>>> If I understand your statements correctly: in MPA v2
> >clint-server
> >> >>>> mode,
> >> >>>> the responder application should have some logic to wait(after
> >> >>>> ESTABLISH
> >> >>>> event) until the first FPDU message is received from the
> >> >> initiator?
> >> >>>>
> >> >>>> Thanks,
> >> >>>> Krishna.
> >> >>>
> >> >>> The responder may either delay the established event
> >> >>> until it sees the first peer FPDU, or it delays sending
> >> >>> the first (now already posted) FPDU, until it sees the
> >> >>> first initiator FPDU. This, since (theoretically), the
> >> >>> responder does not know yet the result of CRC and
> >> >>> Marker negotiation. siw delays the established event,
> >> >>> which seems the more clear path to me.
> >> >>>
> >> >>> This MPA handshake is cumbersome. siw is permissive
> >> >>> for violations of these rules to the extend, that the
> >> >>> responder shall not send back to back the first FPDU
> >> >>> with the MPA reply. If a (Chelsio ?) iWarp adapter
> >> >>> would send the first responder FPDU only after it
> >> >>> got TCP acked its MPA reply, the chances are high the
> >> >>> (siw) iWarp initiator has already transitioned to RTS
> >> >>> and it would accept that FPU, even if not having sent
> >> >>> it's first FPDU.
> >> >>>
> >> >>> siw sticks to this rule to that extend, since all receive
> >> >>> processing is triggered by socket callbacks (from
> >> >>> sk->sk_data_ready()). As long as the QP is not
> >> >>> in RTS, all TCP bytestream processing is done by the
> >> >>> MPA protocol (code in siw_cm.c). If the QP reaches
> >> >>> RTS, RX processing is done bt the full iWarp
> >> >>> protocol (code in siw_qp_rx.c for RX).
> >> >>> Now, if the higher layer (application) protocol has
> >> >>> a semantic where the responder sends the first message,
> >> >>> we could end up in an infinite wait for that packet
> >> >>> at initiator application side.
> >> >>> This, since the complete first FPDU was already
> >> >>> received at the TCP socket, while the QP was not in
> >> >>> RTS. It will not trigger any additional sk->sk_data_ready()
> >> >>> and we are stuck with a FPDU in the socket rx buffer.
> >> >>>
> >> >>> I implemented that 4 bytes extra data testing only to
> >> >>> keep the siw protocol state machine as simple as
> >> >>> possible (it is already unfortunately complex, if
> >> >>> you count the lines in siw_cm.c), while following
> >> >>> the protocol rules.
> >> >>>
> >> >>> I suggest to correctly implement peer2peer mode and
> >> >>> delay the established event at the responder side until
> >> >>> it got the negotiated first zero length FPDU. Out
> >> >>> of the possible options, siw supports both zero length
> >> >>> WRITE and READ, but no SEND, since this would consume
> >> >>> an extra receive WQE.
> >> >>>
> >> >>> As a last resort, I might consider extending the
> >> >>> siw state machine to handle those - non-complaint -
> >> >>> cases gracefully. But, that would result in different
> >> >>> code than just avoiding checking for peer protocol
> >> >>> violation.
> >> >>
> >> >> Bernard, everything you describe in siw seems perfectly valid
> >> >> to me, and in fact that's why the MPA enhanced connection mode
> >> >> was designed the way it is. I disagree that it's "cumbersome",
> >> >> but that's a nit.
> >> >>
> >> >> The issue is that the responder reaches RTS at a different time
> >> >> than the initiator reaches RTR. The original iWARP connection
> >> >> model did not make any requirement, and races were possible.
> >> >> MPAv2 fixes those.
> >> >>
> >> > Tom, thanks, exactly. MPAv2 fixes it. And an implementation
> >
> >Thanks Bernard & Tom,
> >
> >Could you also please consider making MPA ehanced connection
> >mode(with RTR handshake) as default for SIW, as MPA V2 peer-
> >to-peer mode seems to be more promising that MPA V2
> >client-server Mode, wrt race conditions.
> >
> >Currently SIW has MPA V2 client-server Mode as default.
> >
> >
> >In siw_main.c:
> >         const bool peer_to_peer = 1;
> >
> >
> >Between, as per my observations, the chances of hitting this 'connect
> >error' issue is higher with:   SIW(initator)<--->(responder)SIW
> >than:    SIW(initator) <---> (responder)hard-iwarp
> 
> Good to know! Yes, siw is permissive at responder side, it's
> applications responsibility to delay the first message, if p2p
> mode is not selected.
> What application are you using here (which let's
> the responder immediately start using the SQ)? Just to let
> me re-create the case w/o effort.
> I might produce a fix for siw which can handle that
> case for the non-p2p-mode case.

I originally noticed this issue with a bidirectional bandwidth
measurement tool(that we use internally).

Later I wrote a small program to quickly recreate this issue.
please follow the below steps:

- git clone https://github.com/krishna-e/rdma-example-programs
- cd rdma-example-programs/IBV_WR_RDMA_WRITE_for_SIW_plus_4_issue/

- server side:
	# cc -g server.c  -libverbs -lrdmacm  -o server
	# ./server <any_integer_data>


- client side:
	# echo -n 'module siw +p' >
	# /sys/kernel/debug/dynamic_debug/control
        (just to slow down the client processing by enabling traces)

	# cc -g client.c  -libverbs -lrdmacm  -o client
	# ./client <ip_address>

	output:
	Unexpected event : RDMA_CM_EVENT_CONNECT_ERROR

> 
> Setting p2p mode as default is another option. I am not
> immediately happy with that suggestion, since it adds
> another wire transfer as default. The future shall see
> those parameters settable. These were module parameters,
> but we abandoned that. We should probably come up with
> an extension to the RDMA netlink interface, which
> makes those protocol specific modes and flavors settable
> per device...?
> 
> Thanks
> Bernard.
> >
> >
> >> > of that fix MUST adhere to the defined rules. If one
> >> > negotiates an extra handshake to synchronize, it
> >> > MUST adhere to that handshake rules. There is no point in
> >> > negotiating an extra handshake, and right away ignoring it.
> >> > If the responder wants a zero length WRITE, it MUST use
> >> > its reception to synchronize its transition to RTS.
> >> > 
> >> > Shall we extend the siw state machine to support silly
> >> > peer behavior? As said, if I read there is no way of
> >> 
> >> Of course the right answer to this is "no", but I'd reserve
> >> drawing that conclusion until I am certain it won't cause
> >> trouble. If existing iWARP implementations don't enforce
> >> this, and SIW becomes the odd duck failing connections, then
> >> I'm afraid I would have to say "yes" instead.
> >> 
> >> > making the peer implementation RFC compliant (it would
> >> > be unexpected), I'll look into the applicability of
> >> > Postel's Rule. But let's make those things explicit.
> >> 
> >> Explicit is best, agreed.
> >> 
> >> Tom.
> >> 
> >> > 
> >> >> SIW is certainly within its rights to prevent peers from
> >> >> sending prior to responder RTR. I'd suggest that if possible,
> >> >> you try to detect the deadlock rather than flatly rejecting any
> >> >> incoming bytes. The Internet Principle (be liberal in what you
> >> >> accept...), after all.
> >> >>
> >> >> This same dance happens in IB/RoCE, btw. Just via different
> >> >> messages.
> >> >>
> >> > Interesting!
> >> > 
> >> > Thanks
> >> > Bernard.
> >> > 
> >> >> Tom.
> >> >>
> >> >>
> >> >>>
> >> >>> Thanks,
> >> >>> Bernard.
> >> >>>
> >> >>>
> >> >>>>>
> >> >>>>> Sorry for the confusion. But the current
> >> >>>>> siw code appears to be just correct.
> >> >>>>>
> >> >>>>> Thanks
> >> >>>>> Bernard
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>> +       if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) ==
> >> >>>>>> MPA_REVISION_2) &&
> >> >>>>>> +               (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) {
> >> >>>>>> +               int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord
> >&
> >> >>>>>> +                               (MPA_V2_RDMA_WRITE_RTR |
> >> >>>>>> MPA_V2_RDMA_READ_RTR);
> >> >>>>>> +               if (!mpa_p2p_mode)
> >> >>>>>> +                       extra_data_check = 0;
> >> >>>>>> +       }
> >> >>>>>> +
> >> >>>>>>          /*
> >> >>>>>>           * At this point, we must have hdr->params.pd_len
> >!= 0.
> >> >>>>>>           * A private data buffer gets allocated if
> >> >>>> hdr->params.pd_len
> >> >>>>>> !=
> >> >>>>>>           * 0.
> >> >>>>>>           */
> >> >>>>>>          if (!cep->mpa.pdata) {
> >> >>>>>> -               cep->mpa.pdata = kmalloc(pd_len + 4,
> >> >> GFP_KERNEL);
> >> >>>>>> +               cep->mpa.pdata = kmalloc(pd_len +
> >> >>>> extra_data_check,
> >> >>>>>> GFP_KERNEL);
> >> >>>>>>                  if (!cep->mpa.pdata)
> >> >>>>>>                          return -ENOMEM;
> >> >>>>>>          }
> >> >>>>>>          rcvd = ksock_recv(
> >> >>>>>>                  s, cep->mpa.pdata + cep->mpa.bytes_rcvd -
> >> >>>>>> sizeof(struct
> >> >>>>>> mpa_rr),
> >> >>>>>> -               to_rcv + 4, MSG_DONTWAIT);
> >> >>>>>> +               to_rcv + extra_data_check, MSG_DONTWAIT);
> >> >>>>>>
> >> >>>>>>          if (rcvd < 0)
> >> >>>>>>                  return rcvd;
> >> >>>>>>
> >> >>>>>> -       if (rcvd > to_rcv)
> >> >>>>>> +       if (extra_data_check && (rcvd > to_rcv))
> >> >>>>>>                  return -EPROTO;
> >> >>>>>>
> >> >>>>>>          cep->mpa.bytes_rcvd += rcvd;
> >> >>>>>>
> >> >>>>>> -Krishna.
> >> >>>>>>>
> >> >>>>>>> Tom.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Bernard.
> >> >>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>> So, for the MPA request case, this code is needed
> >> >>>>>>>>>> to check for protocol correctness.
> >> >>>>>>>>>> You are right for the MPA reply case - if we are
> >> >>>>>>>>>> _not_ in peer2peer mode, the peer can immediately
> >> >>>>>>>>>> start sending data in RDMA mode after its MPA Reply.
> >> >>>>>>>>>> So we shall add appropriate code to be more specific
> >> >>>>>>>>>> For an error, we are (1) processing an MPA Request,
> >> >>>>>>>>>> OR (2) processing an MPA Reply AND we are not expected
> >> >>>>>>>>>> to send an initial READ/WRITE/Send as negotiated with
> >> >>>>>>>>>> the peer (peer2peer mode MPA handshake).
> >> >>>>>>>>>>
> >> >>>>>>>>>> Just removing this check would make siw more permissive,
> >> >>>>>>>>>> but to a point where peer MPA protocol errors are
> >> >>>>>>>>>> tolerated. I am not in favor of that level of
> >> >>>>>>>>>> forgiveness.
> >> >>>>>>>>>>
> >> >>>>>>>>>> If possible, please provide an appropriate patch
> >> >>>>>>>>>> or (if it causes current issues with another peer
> >> >>>>>>>>>> iWarp implementation) just run in MPA peer2peer mode,
> >> >>>>>>>>>> where the current check is appropriate.
> >> >>>>>>>>>> Otherwise, I would provide an appropriate fix by Monday
> >> >>>>>>>>>> (I am still out of office this week).
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> Many thanks and best regards,
> >> >>>>>>>>>> Bernard.
> >> >>>>>>>>>>
> >> >>
> >> >>
> >> > 
> >> > 
> >> > 
> >
> >
>
diff mbox series

Patch

diff --git a/drivers/infiniband/sw/siw/siw_cm.c b/drivers/infiniband/sw/siw/siw_cm.c
index a7cde98e73e8..8dc8cea2566c 100644
--- a/drivers/infiniband/sw/siw/siw_cm.c
+++ b/drivers/infiniband/sw/siw/siw_cm.c
@@ -559,13 +559,13 @@  static int siw_recv_mpa_rr(struct siw_cep *cep)
 	 * A private data buffer gets allocated if hdr->params.pd_len != 0.
 	 */
 	if (!cep->mpa.pdata) {
-		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
+		cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL);
 		if (!cep->mpa.pdata)
 			return -ENOMEM;
 	}
 	rcvd = ksock_recv(
 		s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct mpa_rr),
-		to_rcv + 4, MSG_DONTWAIT);
+		to_rcv, MSG_DONTWAIT);
 
 	if (rcvd < 0)
 		return rcvd;