diff mbox series

[1/1] IB/mlx5: Add a signature check to received EQEs and CQEs

Message ID 20221005174521.63619-1-rohit.sajan.kumar@oracle.com (mailing list archive)
State Changes Requested
Headers show
Series [1/1] IB/mlx5: Add a signature check to received EQEs and CQEs | expand

Commit Message

Rohit Nair Oct. 5, 2022, 5:45 p.m. UTC
As PRM defines, the bytewise XOR of the EQE and the EQE index should be
0xff. Otherwise, we can assume we have a corrupt EQE. The same is
applicable to CQE as well.

Adding a check to verify the EQE and CQE is valid in that aspect and if
not, dump the CQE and EQE to dmesg to be inspected.

This patch does not introduce any significant performance degradations
and has been tested using qperf.

Suggested-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Rohit Nair <rohit.sajan.kumar@oracle.com>
---
 drivers/infiniband/hw/mlx5/cq.c              | 40 ++++++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 39 +++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

Comments

Leon Romanovsky Oct. 11, 2022, 7:17 a.m. UTC | #1
There is no need to ping anyone, the patch is registered in patchworks
https://patchwork.kernel.org/project/linux-rdma/patch/20221005174521.63619-1-rohit.sajan.kumar@oracle.com/
and we will get to it.

You sent the patch during merge window, no wonder that none looked on it.

On Wed, Oct 05, 2022 at 10:45:20AM -0700, Rohit Nair wrote:
> As PRM defines, the bytewise XOR of the EQE and the EQE index should be
> 0xff. Otherwise, we can assume we have a corrupt EQE. The same is
> applicable to CQE as well.

I didn't find anything like this in my version of PRM.

> 
> Adding a check to verify the EQE and CQE is valid in that aspect and if
> not, dump the CQE and EQE to dmesg to be inspected.

While it is nice to see prints in dmesg, you need to explain why other
mechanisms (reporters, mlx5 events, e.t.c) are not enough.

> 
> This patch does not introduce any significant performance degradations
> and has been tested using qperf.

What does it mean? You made changes in kernel verbs flow, they are not
executed through qperf.

> 
> Suggested-by: Michael Guralnik <michaelgur@nvidia.com>
> Signed-off-by: Rohit Nair <rohit.sajan.kumar@oracle.com>
> ---
>  drivers/infiniband/hw/mlx5/cq.c              | 40 ++++++++++++++++++++++++++++
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c | 39 +++++++++++++++++++++++++++
>  2 files changed, 79 insertions(+)
> 
> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> index be189e0..2a6d722 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -441,6 +441,44 @@ static void mlx5_ib_poll_sw_comp(struct mlx5_ib_cq *cq, int num_entries,
>  	}
>  }
>  
> +static void verify_cqe(struct mlx5_cqe64 *cqe64, struct mlx5_ib_cq *cq)
> +{
> +	int i = 0;
> +	u64 temp_xor = 0;
> +	struct mlx5_ib_dev *dev = to_mdev(cq->ibcq.device);
> +
> +	u32 cons_index = cq->mcq.cons_index;
> +	u64 *eight_byte_raw_cqe = (u64 *)cqe64;
> +	u8 *temp_bytewise_xor = (u8 *)(&temp_xor);
> +	u8 cqe_bytewise_xor = (cons_index & 0xff) ^
> +				((cons_index & 0xff00) >> 8) ^
> +				((cons_index & 0xff0000) >> 16);
> +
> +	for (i = 0; i < sizeof(struct mlx5_cqe64); i += 8) {
> +		temp_xor ^= *eight_byte_raw_cqe;
> +		eight_byte_raw_cqe++;
> +	}
> +
> +	for (i = 0; i < (sizeof(u64)); i++) {
> +		cqe_bytewise_xor ^= *temp_bytewise_xor;
> +		temp_bytewise_xor++;
> +	}
> +
> +	if (cqe_bytewise_xor == 0xff)
> +		return;
> +
> +	dev_err(&dev->mdev->pdev->dev,
> +		"Faulty CQE - checksum failure: cqe=0x%x cqn=0x%x cqe_bytewise_xor=0x%x\n",
> +		cq->ibcq.cqe, cq->mcq.cqn, cqe_bytewise_xor);
> +	dev_err(&dev->mdev->pdev->dev,
> +		"cons_index=%u arm_sn=%u irqn=%u cqe_size=0x%x\n",
> +		cq->mcq.cons_index, cq->mcq.arm_sn, cq->mcq.irqn, cq->mcq.cqe_sz);

mlx5_err ... and not dev_err ...

> +
> +	print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET,
> +		       16, 1, cqe64, sizeof(*cqe64), false);
> +	BUG();

No BUG() in new code.

Thanks
Rohit Nair Oct. 25, 2022, 5:44 p.m. UTC | #2
Hey Leon,

Please find my replies to your comments here below:


On 10/11/22 12:17 AM, Leon Romanovsky wrote:
> There is no need to ping anyone, the patch is registered in patchworks
> https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-rdma/patch/20221005174521.63619-1-rohit.sajan.kumar@oracle.com/__;!!ACWV5N9M2RV99hQ!IdRYzr4ujJ0haaWRKGd05SNbDiiW4v85yzCS233IObdO6ziwUhmEpWBC73PMs1dwbjwL5qHv9YwJrmKNtIo$
> and we will get to it.
>
> You sent the patch during merge window, no wonder that none looked on it.
>
> On Wed, Oct 05, 2022 at 10:45:20AM -0700, Rohit Nair wrote:
>> As PRM defines, the bytewise XOR of the EQE and the EQE index should be
>> 0xff. Otherwise, we can assume we have a corrupt EQE. The same is
>> applicable to CQE as well.
> I didn't find anything like this in my version of PRM.

The PRM does contain this information and can be seen on page 121 of the 
reference manual. I have linked the refernece manual I consulted for 
your reference.
https://network.nvidia.com/files/doc-2020/ethernet-adapters-programming-manual.pdf
Here is a short extract from the refernce manual: "Byte-wise XOR of CQE 
- signature protection (see "Completion and Event Queue Elements (CQEs 
and EQEs)" on page 156). CQE is valid if byte-wise XOR of entire CQE 
(including signature field) and the CQE index is 0xff. For 128B CQE, the 
GRH/inline_64 section is taken into account if data / GRH was written to 
it (cqe_format == 2 or grh == 2)"

>
>> Adding a check to verify the EQE and CQE is valid in that aspect and if
>> not, dump the CQE and EQE to dmesg to be inspected.
> While it is nice to see prints in dmesg, you need to explain why other
> mechanisms (reporters, mlx5 events, e.t.c) are not enough.

We had recently faces an issue where the we failing to arm the CQ due to 
an sn_mismatch. This issue was not reported in the logs or error events 
and was the same for the corrupted CQE.
If we were able to dectect the corrupted CQE, or EQE earlier, we may 
have been able to respond to the issue better and in a more timely 
manner. This is our motivation behind have the error printed to dmesg.

>
>> This patch does not introduce any significant performance degradations
>> and has been tested using qperf.
> What does it mean? You made changes in kernel verbs flow, they are not
> executed through qperf.
We also conducted several extensive performance tests using our 
test-suite which utilizes rds-stress and also saw no significant 
performance degrdations in those results.

>> Suggested-by: Michael Guralnik <michaelgur@nvidia.com>
>> Signed-off-by: Rohit Nair <rohit.sajan.kumar@oracle.com>
>> ---
>>   drivers/infiniband/hw/mlx5/cq.c              | 40 ++++++++++++++++++++++++++++
>>   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 39 +++++++++++++++++++++++++++
>>   2 files changed, 79 insertions(+)
>>
>> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
>> index be189e0..2a6d722 100644
>> --- a/drivers/infiniband/hw/mlx5/cq.c
>> +++ b/drivers/infiniband/hw/mlx5/cq.c
>> @@ -441,6 +441,44 @@ static void mlx5_ib_poll_sw_comp(struct mlx5_ib_cq *cq, int num_entries,
>>   	}
>>   }
>>   
>> +static void verify_cqe(struct mlx5_cqe64 *cqe64, struct mlx5_ib_cq *cq)
>> +{
>> +	int i = 0;
>> +	u64 temp_xor = 0;
>> +	struct mlx5_ib_dev *dev = to_mdev(cq->ibcq.device);
>> +
>> +	u32 cons_index = cq->mcq.cons_index;
>> +	u64 *eight_byte_raw_cqe = (u64 *)cqe64;
>> +	u8 *temp_bytewise_xor = (u8 *)(&temp_xor);
>> +	u8 cqe_bytewise_xor = (cons_index & 0xff) ^
>> +				((cons_index & 0xff00) >> 8) ^
>> +				((cons_index & 0xff0000) >> 16);
>> +
>> +	for (i = 0; i < sizeof(struct mlx5_cqe64); i += 8) {
>> +		temp_xor ^= *eight_byte_raw_cqe;
>> +		eight_byte_raw_cqe++;
>> +	}
>> +
>> +	for (i = 0; i < (sizeof(u64)); i++) {
>> +		cqe_bytewise_xor ^= *temp_bytewise_xor;
>> +		temp_bytewise_xor++;
>> +	}
>> +
>> +	if (cqe_bytewise_xor == 0xff)
>> +		return;
>> +
>> +	dev_err(&dev->mdev->pdev->dev,
>> +		"Faulty CQE - checksum failure: cqe=0x%x cqn=0x%x cqe_bytewise_xor=0x%x\n",
>> +		cq->ibcq.cqe, cq->mcq.cqn, cqe_bytewise_xor);
>> +	dev_err(&dev->mdev->pdev->dev,
>> +		"cons_index=%u arm_sn=%u irqn=%u cqe_size=0x%x\n",
>> +		cq->mcq.cons_index, cq->mcq.arm_sn, cq->mcq.irqn, cq->mcq.cqe_sz);
> mlx5_err ... and not dev_err ...
Will update dev_err to mlx5_err.
>
>> +
>> +	print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET,
>> +		       16, 1, cqe64, sizeof(*cqe64), false);
>> +	BUG();
> No BUG() in new code.

I will remove the BUG() calls and only have it print the error msgs.

Thanks.


Best,

Rohit
Leon Romanovsky Oct. 27, 2022, 12:23 p.m. UTC | #3
On Tue, Oct 25, 2022 at 10:44:12AM -0700, Rohit Nair wrote:
> Hey Leon,
> 
> Please find my replies to your comments here below:

<...>

> > 
> > > This patch does not introduce any significant performance degradations
> > > and has been tested using qperf.
> > What does it mean? You made changes in kernel verbs flow, they are not
> > executed through qperf.
> We also conducted several extensive performance tests using our test-suite
> which utilizes rds-stress and also saw no significant performance
> degrdations in those results.

What does it mean "also"? Your change is applicable ONLY for kernel path.

Anyway, I'm not keen adding rare debug code to performance critical path.

Thanks
Rohit Nair Oct. 28, 2022, 11:48 p.m. UTC | #4
On 10/27/22 5:23 AM, Leon Romanovsky wrote:
> On Tue, Oct 25, 2022 at 10:44:12AM -0700, Rohit Nair wrote:
>> Hey Leon,
>>
>> Please find my replies to your comments here below:
> 
> <...>
> 
>>>
>>>> This patch does not introduce any significant performance degradations
>>>> and has been tested using qperf.
>>> What does it mean? You made changes in kernel verbs flow, they are not
>>> executed through qperf.
>> We also conducted several extensive performance tests using our test-suite
>> which utilizes rds-stress and also saw no significant performance
>> degrdations in those results.
> 
> What does it mean "also"? Your change is applicable ONLY for kernel path.
> 
> Anyway, I'm not keen adding rare debug code to performance critical path.
> 
> Thanks

rds-stress exercises the codepath we are modifying here. rds-stress 
didn't show much of performance degrade when we ran internally. We also 
requested our DB team for performance regression testing and this change 
passed their test suite. This motivated us to submit this to upstream.

If there is any other test that is better suited for this change, I am 
willing to test it. Please let me know if you have something in mind. We 
can revisit this patch after such a test may be.

I agree that, this was a rare debug scenario, but it took lot more than 
needed to narrow down[engaged vendor on live sessions]. We are adding 
this in the hope to finding the cause at the earliest or at least point 
us which direction to look at. We also requested the vendor[mlx] to 
include some diagnostics[HW counter], which can help us narrow it faster 
next time. This is our attempt to add kernel side of diagnostics.

Feel free to share your suggestions

Thanks
Leon Romanovsky Nov. 6, 2022, 6:03 p.m. UTC | #5
On Fri, Oct 28, 2022 at 04:48:53PM -0700, Rohit Nair wrote:
> On 10/27/22 5:23 AM, Leon Romanovsky wrote:
> > On Tue, Oct 25, 2022 at 10:44:12AM -0700, Rohit Nair wrote:
> > > Hey Leon,
> > > 
> > > Please find my replies to your comments here below:
> > 
> > <...>
> > 
> > > > 
> > > > > This patch does not introduce any significant performance degradations
> > > > > and has been tested using qperf.
> > > > What does it mean? You made changes in kernel verbs flow, they are not
> > > > executed through qperf.
> > > We also conducted several extensive performance tests using our test-suite
> > > which utilizes rds-stress and also saw no significant performance
> > > degrdations in those results.
> > 
> > What does it mean "also"? Your change is applicable ONLY for kernel path.
> > 
> > Anyway, I'm not keen adding rare debug code to performance critical path.
> > 
> > Thanks
> 
> rds-stress exercises the codepath we are modifying here. rds-stress didn't
> show much of performance degrade when we ran internally. We also requested
> our DB team for performance regression testing and this change passed their
> test suite. This motivated us to submit this to upstream.
> 
> If there is any other test that is better suited for this change, I am
> willing to test it. Please let me know if you have something in mind. We can
> revisit this patch after such a test may be.
> 
> I agree that, this was a rare debug scenario, but it took lot more than
> needed to narrow down[engaged vendor on live sessions]. We are adding this
> in the hope to finding the cause at the earliest or at least point us which
> direction to look at. We also requested the vendor[mlx] to include some
> diagnostics[HW counter], which can help us narrow it faster next time. This
> is our attempt to add kernel side of diagnostics.

The thing is that "vendor" failed to explain internally if this debug
code is useful. Like I said, extremely rare debug code shouldn't be part
of main data path.

Thanks

> 
> Feel free to share your suggestions
> 
> Thanks
>
Rohit Nair Nov. 7, 2022, 5:51 p.m. UTC | #6
On 11/6/22 10:03 AM, Leon Romanovsky wrote:
>>
>> rds-stress exercises the codepath we are modifying here. rds-stress didn't
>> show much of performance degrade when we ran internally. We also requested
>> our DB team for performance regression testing and this change passed their
>> test suite. This motivated us to submit this to upstream.
>>
>> If there is any other test that is better suited for this change, I am
>> willing to test it. Please let me know if you have something in mind. We can
>> revisit this patch after such a test may be.
>>
>> I agree that, this was a rare debug scenario, but it took lot more than
>> needed to narrow down[engaged vendor on live sessions]. We are adding this
>> in the hope to finding the cause at the earliest or at least point us which
>> direction to look at. We also requested the vendor[mlx] to include some
>> diagnostics[HW counter], which can help us narrow it faster next time. This
>> is our attempt to add kernel side of diagnostics.
> 
> The thing is that "vendor" failed to explain internally if this debug
> code is useful. Like I said, extremely rare debug code shouldn't be part
> of main data path.
> 
> Thanks
>

I understand.
Thank you for taking the time to review this patch.


Best,
Rohit.
Leon Romanovsky Nov. 9, 2022, 6:33 p.m. UTC | #7
On Tue, Nov 08, 2022 at 04:24:48PM -0600, rama nichanamatlu wrote:
> in-line.

<...>

> > The thing is that "vendor" failed to explain internally if this debug
> > code is useful. Like I said, extremely rare debug code shouldn't be part
> > of main data path.
> > 
> > Thanks
> 
> thank you very much for you insights into the relevance of the patch. before
> we close this topic, do want to ask on this.  what is the expectation of the
> nic hardware when it signatures / checksum's an eqe or cqe ?
> 
> if it not to be verified by the receiver host for what ever reasons, then
> why even do the checksum computation on the hardware ? 

mlx5 data sheet has more than 3000 pages in it. Not everything there is needed now.

Thanks
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index be189e0..2a6d722 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -441,6 +441,44 @@  static void mlx5_ib_poll_sw_comp(struct mlx5_ib_cq *cq, int num_entries,
 	}
 }
 
+static void verify_cqe(struct mlx5_cqe64 *cqe64, struct mlx5_ib_cq *cq)
+{
+	int i = 0;
+	u64 temp_xor = 0;
+	struct mlx5_ib_dev *dev = to_mdev(cq->ibcq.device);
+
+	u32 cons_index = cq->mcq.cons_index;
+	u64 *eight_byte_raw_cqe = (u64 *)cqe64;
+	u8 *temp_bytewise_xor = (u8 *)(&temp_xor);
+	u8 cqe_bytewise_xor = (cons_index & 0xff) ^
+				((cons_index & 0xff00) >> 8) ^
+				((cons_index & 0xff0000) >> 16);
+
+	for (i = 0; i < sizeof(struct mlx5_cqe64); i += 8) {
+		temp_xor ^= *eight_byte_raw_cqe;
+		eight_byte_raw_cqe++;
+	}
+
+	for (i = 0; i < (sizeof(u64)); i++) {
+		cqe_bytewise_xor ^= *temp_bytewise_xor;
+		temp_bytewise_xor++;
+	}
+
+	if (cqe_bytewise_xor == 0xff)
+		return;
+
+	dev_err(&dev->mdev->pdev->dev,
+		"Faulty CQE - checksum failure: cqe=0x%x cqn=0x%x cqe_bytewise_xor=0x%x\n",
+		cq->ibcq.cqe, cq->mcq.cqn, cqe_bytewise_xor);
+	dev_err(&dev->mdev->pdev->dev,
+		"cons_index=%u arm_sn=%u irqn=%u cqe_size=0x%x\n",
+		cq->mcq.cons_index, cq->mcq.arm_sn, cq->mcq.irqn, cq->mcq.cqe_sz);
+
+	print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET,
+		       16, 1, cqe64, sizeof(*cqe64), false);
+	BUG();
+}
+
 static int mlx5_poll_one(struct mlx5_ib_cq *cq,
 			 struct mlx5_ib_qp **cur_qp,
 			 struct ib_wc *wc)
@@ -463,6 +501,8 @@  static int mlx5_poll_one(struct mlx5_ib_cq *cq,
 
 	cqe64 = (cq->mcq.cqe_sz == 64) ? cqe : cqe + 64;
 
+	verify_cqe(cqe64, cq);
+
 	++cq->mcq.cons_index;
 
 	/* Make sure we read CQ entry contents after we've checked the
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 229728c..f2a6d8b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -102,6 +102,43 @@  static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
 	return cq;
 }
 
+static void verify_eqe(struct mlx5_eq *eq, struct mlx5_eqe *eqe)
+{
+	u64 *eight_byte_raw_eqe = (u64 *)eqe;
+	u8 eqe_bytewise_xor = (eq->cons_index & 0xff) ^
+			      ((eq->cons_index & 0xff00) >> 8) ^
+			      ((eq->cons_index & 0xff0000) >> 16);
+
+	int i = 0;
+	u64 temp_xor = 0;
+	u8 *temp_bytewise_xor = (u8 *)(&temp_xor);
+
+	for (i = 0; i < sizeof(struct mlx5_eqe); i += 8) {
+		temp_xor ^= *eight_byte_raw_eqe;
+		eight_byte_raw_eqe++;
+	}
+
+	for (i = 0; i < (sizeof(u64)); i++) {
+		eqe_bytewise_xor ^= *temp_bytewise_xor;
+		temp_bytewise_xor++;
+	}
+
+	if (eqe_bytewise_xor == 0xff)
+		return;
+
+	dev_err(&eq->dev->pdev->dev,
+		"Faulty EQE - checksum failure: ci=0x%x eqe_type=0x%x eqe_bytewise_xor=0x%x",
+		eq->cons_index, eqe->type, eqe_bytewise_xor);
+
+	dev_err(&eq->dev->pdev->dev,
+		"EQ addr=%p eqn=%u irqn=%u vec_index=%u",
+		eq, eq->eqn, eq->irqn, eq->vecidx);
+
+	print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET,
+		       16, 1, eqe, sizeof(*eqe), false);
+	BUG();
+}
+
 static int mlx5_eq_comp_int(struct notifier_block *nb,
 			    __always_unused unsigned long action,
 			    __always_unused void *data)
@@ -127,6 +164,8 @@  static int mlx5_eq_comp_int(struct notifier_block *nb,
 		/* Assume (eqe->type) is always MLX5_EVENT_TYPE_COMP */
 		cqn = be32_to_cpu(eqe->data.comp.cqn) & 0xffffff;
 
+		verify_eqe(eq, eqe);
+
 		cq = mlx5_eq_cq_get(eq, cqn);
 		if (likely(cq)) {
 			++cq->arm_sn;