Message ID | 20221005174521.63619-1-rohit.sajan.kumar@oracle.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [1/1] IB/mlx5: Add a signature check to received EQEs and CQEs | expand |
There is no need to ping anyone, the patch is registered in patchworks https://patchwork.kernel.org/project/linux-rdma/patch/20221005174521.63619-1-rohit.sajan.kumar@oracle.com/ and we will get to it. You sent the patch during merge window, no wonder that none looked on it. On Wed, Oct 05, 2022 at 10:45:20AM -0700, Rohit Nair wrote: > As PRM defines, the bytewise XOR of the EQE and the EQE index should be > 0xff. Otherwise, we can assume we have a corrupt EQE. The same is > applicable to CQE as well. I didn't find anything like this in my version of PRM. > > Adding a check to verify the EQE and CQE is valid in that aspect and if > not, dump the CQE and EQE to dmesg to be inspected. While it is nice to see prints in dmesg, you need to explain why other mechanisms (reporters, mlx5 events, e.t.c) are not enough. > > This patch does not introduce any significant performance degradations > and has been tested using qperf. What does it mean? You made changes in kernel verbs flow, they are not executed through qperf. > > Suggested-by: Michael Guralnik <michaelgur@nvidia.com> > Signed-off-by: Rohit Nair <rohit.sajan.kumar@oracle.com> > --- > drivers/infiniband/hw/mlx5/cq.c | 40 ++++++++++++++++++++++++++++ > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 39 +++++++++++++++++++++++++++ > 2 files changed, 79 insertions(+) > > diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c > index be189e0..2a6d722 100644 > --- a/drivers/infiniband/hw/mlx5/cq.c > +++ b/drivers/infiniband/hw/mlx5/cq.c > @@ -441,6 +441,44 @@ static void mlx5_ib_poll_sw_comp(struct mlx5_ib_cq *cq, int num_entries, > } > } > > +static void verify_cqe(struct mlx5_cqe64 *cqe64, struct mlx5_ib_cq *cq) > +{ > + int i = 0; > + u64 temp_xor = 0; > + struct mlx5_ib_dev *dev = to_mdev(cq->ibcq.device); > + > + u32 cons_index = cq->mcq.cons_index; > + u64 *eight_byte_raw_cqe = (u64 *)cqe64; > + u8 *temp_bytewise_xor = (u8 *)(&temp_xor); > + u8 cqe_bytewise_xor = (cons_index & 0xff) ^ > + ((cons_index & 0xff00) >> 8) ^ > + ((cons_index & 0xff0000) >> 16); > + > + for (i = 0; i < sizeof(struct mlx5_cqe64); i += 8) { > + temp_xor ^= *eight_byte_raw_cqe; > + eight_byte_raw_cqe++; > + } > + > + for (i = 0; i < (sizeof(u64)); i++) { > + cqe_bytewise_xor ^= *temp_bytewise_xor; > + temp_bytewise_xor++; > + } > + > + if (cqe_bytewise_xor == 0xff) > + return; > + > + dev_err(&dev->mdev->pdev->dev, > + "Faulty CQE - checksum failure: cqe=0x%x cqn=0x%x cqe_bytewise_xor=0x%x\n", > + cq->ibcq.cqe, cq->mcq.cqn, cqe_bytewise_xor); > + dev_err(&dev->mdev->pdev->dev, > + "cons_index=%u arm_sn=%u irqn=%u cqe_size=0x%x\n", > + cq->mcq.cons_index, cq->mcq.arm_sn, cq->mcq.irqn, cq->mcq.cqe_sz); mlx5_err ... and not dev_err ... > + > + print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET, > + 16, 1, cqe64, sizeof(*cqe64), false); > + BUG(); No BUG() in new code. Thanks
Hey Leon, Please find my replies to your comments here below: On 10/11/22 12:17 AM, Leon Romanovsky wrote: > There is no need to ping anyone, the patch is registered in patchworks > https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-rdma/patch/20221005174521.63619-1-rohit.sajan.kumar@oracle.com/__;!!ACWV5N9M2RV99hQ!IdRYzr4ujJ0haaWRKGd05SNbDiiW4v85yzCS233IObdO6ziwUhmEpWBC73PMs1dwbjwL5qHv9YwJrmKNtIo$ > and we will get to it. > > You sent the patch during merge window, no wonder that none looked on it. > > On Wed, Oct 05, 2022 at 10:45:20AM -0700, Rohit Nair wrote: >> As PRM defines, the bytewise XOR of the EQE and the EQE index should be >> 0xff. Otherwise, we can assume we have a corrupt EQE. The same is >> applicable to CQE as well. > I didn't find anything like this in my version of PRM. The PRM does contain this information and can be seen on page 121 of the reference manual. I have linked the refernece manual I consulted for your reference. https://network.nvidia.com/files/doc-2020/ethernet-adapters-programming-manual.pdf Here is a short extract from the refernce manual: "Byte-wise XOR of CQE - signature protection (see "Completion and Event Queue Elements (CQEs and EQEs)" on page 156). CQE is valid if byte-wise XOR of entire CQE (including signature field) and the CQE index is 0xff. For 128B CQE, the GRH/inline_64 section is taken into account if data / GRH was written to it (cqe_format == 2 or grh == 2)" > >> Adding a check to verify the EQE and CQE is valid in that aspect and if >> not, dump the CQE and EQE to dmesg to be inspected. > While it is nice to see prints in dmesg, you need to explain why other > mechanisms (reporters, mlx5 events, e.t.c) are not enough. We had recently faces an issue where the we failing to arm the CQ due to an sn_mismatch. This issue was not reported in the logs or error events and was the same for the corrupted CQE. If we were able to dectect the corrupted CQE, or EQE earlier, we may have been able to respond to the issue better and in a more timely manner. This is our motivation behind have the error printed to dmesg. > >> This patch does not introduce any significant performance degradations >> and has been tested using qperf. > What does it mean? You made changes in kernel verbs flow, they are not > executed through qperf. We also conducted several extensive performance tests using our test-suite which utilizes rds-stress and also saw no significant performance degrdations in those results. >> Suggested-by: Michael Guralnik <michaelgur@nvidia.com> >> Signed-off-by: Rohit Nair <rohit.sajan.kumar@oracle.com> >> --- >> drivers/infiniband/hw/mlx5/cq.c | 40 ++++++++++++++++++++++++++++ >> drivers/net/ethernet/mellanox/mlx5/core/eq.c | 39 +++++++++++++++++++++++++++ >> 2 files changed, 79 insertions(+) >> >> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c >> index be189e0..2a6d722 100644 >> --- a/drivers/infiniband/hw/mlx5/cq.c >> +++ b/drivers/infiniband/hw/mlx5/cq.c >> @@ -441,6 +441,44 @@ static void mlx5_ib_poll_sw_comp(struct mlx5_ib_cq *cq, int num_entries, >> } >> } >> >> +static void verify_cqe(struct mlx5_cqe64 *cqe64, struct mlx5_ib_cq *cq) >> +{ >> + int i = 0; >> + u64 temp_xor = 0; >> + struct mlx5_ib_dev *dev = to_mdev(cq->ibcq.device); >> + >> + u32 cons_index = cq->mcq.cons_index; >> + u64 *eight_byte_raw_cqe = (u64 *)cqe64; >> + u8 *temp_bytewise_xor = (u8 *)(&temp_xor); >> + u8 cqe_bytewise_xor = (cons_index & 0xff) ^ >> + ((cons_index & 0xff00) >> 8) ^ >> + ((cons_index & 0xff0000) >> 16); >> + >> + for (i = 0; i < sizeof(struct mlx5_cqe64); i += 8) { >> + temp_xor ^= *eight_byte_raw_cqe; >> + eight_byte_raw_cqe++; >> + } >> + >> + for (i = 0; i < (sizeof(u64)); i++) { >> + cqe_bytewise_xor ^= *temp_bytewise_xor; >> + temp_bytewise_xor++; >> + } >> + >> + if (cqe_bytewise_xor == 0xff) >> + return; >> + >> + dev_err(&dev->mdev->pdev->dev, >> + "Faulty CQE - checksum failure: cqe=0x%x cqn=0x%x cqe_bytewise_xor=0x%x\n", >> + cq->ibcq.cqe, cq->mcq.cqn, cqe_bytewise_xor); >> + dev_err(&dev->mdev->pdev->dev, >> + "cons_index=%u arm_sn=%u irqn=%u cqe_size=0x%x\n", >> + cq->mcq.cons_index, cq->mcq.arm_sn, cq->mcq.irqn, cq->mcq.cqe_sz); > mlx5_err ... and not dev_err ... Will update dev_err to mlx5_err. > >> + >> + print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET, >> + 16, 1, cqe64, sizeof(*cqe64), false); >> + BUG(); > No BUG() in new code. I will remove the BUG() calls and only have it print the error msgs. Thanks. Best, Rohit
On Tue, Oct 25, 2022 at 10:44:12AM -0700, Rohit Nair wrote: > Hey Leon, > > Please find my replies to your comments here below: <...> > > > > > This patch does not introduce any significant performance degradations > > > and has been tested using qperf. > > What does it mean? You made changes in kernel verbs flow, they are not > > executed through qperf. > We also conducted several extensive performance tests using our test-suite > which utilizes rds-stress and also saw no significant performance > degrdations in those results. What does it mean "also"? Your change is applicable ONLY for kernel path. Anyway, I'm not keen adding rare debug code to performance critical path. Thanks
On 10/27/22 5:23 AM, Leon Romanovsky wrote: > On Tue, Oct 25, 2022 at 10:44:12AM -0700, Rohit Nair wrote: >> Hey Leon, >> >> Please find my replies to your comments here below: > > <...> > >>> >>>> This patch does not introduce any significant performance degradations >>>> and has been tested using qperf. >>> What does it mean? You made changes in kernel verbs flow, they are not >>> executed through qperf. >> We also conducted several extensive performance tests using our test-suite >> which utilizes rds-stress and also saw no significant performance >> degrdations in those results. > > What does it mean "also"? Your change is applicable ONLY for kernel path. > > Anyway, I'm not keen adding rare debug code to performance critical path. > > Thanks rds-stress exercises the codepath we are modifying here. rds-stress didn't show much of performance degrade when we ran internally. We also requested our DB team for performance regression testing and this change passed their test suite. This motivated us to submit this to upstream. If there is any other test that is better suited for this change, I am willing to test it. Please let me know if you have something in mind. We can revisit this patch after such a test may be. I agree that, this was a rare debug scenario, but it took lot more than needed to narrow down[engaged vendor on live sessions]. We are adding this in the hope to finding the cause at the earliest or at least point us which direction to look at. We also requested the vendor[mlx] to include some diagnostics[HW counter], which can help us narrow it faster next time. This is our attempt to add kernel side of diagnostics. Feel free to share your suggestions Thanks
On Fri, Oct 28, 2022 at 04:48:53PM -0700, Rohit Nair wrote: > On 10/27/22 5:23 AM, Leon Romanovsky wrote: > > On Tue, Oct 25, 2022 at 10:44:12AM -0700, Rohit Nair wrote: > > > Hey Leon, > > > > > > Please find my replies to your comments here below: > > > > <...> > > > > > > > > > > > This patch does not introduce any significant performance degradations > > > > > and has been tested using qperf. > > > > What does it mean? You made changes in kernel verbs flow, they are not > > > > executed through qperf. > > > We also conducted several extensive performance tests using our test-suite > > > which utilizes rds-stress and also saw no significant performance > > > degrdations in those results. > > > > What does it mean "also"? Your change is applicable ONLY for kernel path. > > > > Anyway, I'm not keen adding rare debug code to performance critical path. > > > > Thanks > > rds-stress exercises the codepath we are modifying here. rds-stress didn't > show much of performance degrade when we ran internally. We also requested > our DB team for performance regression testing and this change passed their > test suite. This motivated us to submit this to upstream. > > If there is any other test that is better suited for this change, I am > willing to test it. Please let me know if you have something in mind. We can > revisit this patch after such a test may be. > > I agree that, this was a rare debug scenario, but it took lot more than > needed to narrow down[engaged vendor on live sessions]. We are adding this > in the hope to finding the cause at the earliest or at least point us which > direction to look at. We also requested the vendor[mlx] to include some > diagnostics[HW counter], which can help us narrow it faster next time. This > is our attempt to add kernel side of diagnostics. The thing is that "vendor" failed to explain internally if this debug code is useful. Like I said, extremely rare debug code shouldn't be part of main data path. Thanks > > Feel free to share your suggestions > > Thanks >
On 11/6/22 10:03 AM, Leon Romanovsky wrote: >> >> rds-stress exercises the codepath we are modifying here. rds-stress didn't >> show much of performance degrade when we ran internally. We also requested >> our DB team for performance regression testing and this change passed their >> test suite. This motivated us to submit this to upstream. >> >> If there is any other test that is better suited for this change, I am >> willing to test it. Please let me know if you have something in mind. We can >> revisit this patch after such a test may be. >> >> I agree that, this was a rare debug scenario, but it took lot more than >> needed to narrow down[engaged vendor on live sessions]. We are adding this >> in the hope to finding the cause at the earliest or at least point us which >> direction to look at. We also requested the vendor[mlx] to include some >> diagnostics[HW counter], which can help us narrow it faster next time. This >> is our attempt to add kernel side of diagnostics. > > The thing is that "vendor" failed to explain internally if this debug > code is useful. Like I said, extremely rare debug code shouldn't be part > of main data path. > > Thanks > I understand. Thank you for taking the time to review this patch. Best, Rohit.
On Tue, Nov 08, 2022 at 04:24:48PM -0600, rama nichanamatlu wrote: > in-line. <...> > > The thing is that "vendor" failed to explain internally if this debug > > code is useful. Like I said, extremely rare debug code shouldn't be part > > of main data path. > > > > Thanks > > thank you very much for you insights into the relevance of the patch. before > we close this topic, do want to ask on this. what is the expectation of the > nic hardware when it signatures / checksum's an eqe or cqe ? > > if it not to be verified by the receiver host for what ever reasons, then > why even do the checksum computation on the hardware ? mlx5 data sheet has more than 3000 pages in it. Not everything there is needed now. Thanks
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c index be189e0..2a6d722 100644 --- a/drivers/infiniband/hw/mlx5/cq.c +++ b/drivers/infiniband/hw/mlx5/cq.c @@ -441,6 +441,44 @@ static void mlx5_ib_poll_sw_comp(struct mlx5_ib_cq *cq, int num_entries, } } +static void verify_cqe(struct mlx5_cqe64 *cqe64, struct mlx5_ib_cq *cq) +{ + int i = 0; + u64 temp_xor = 0; + struct mlx5_ib_dev *dev = to_mdev(cq->ibcq.device); + + u32 cons_index = cq->mcq.cons_index; + u64 *eight_byte_raw_cqe = (u64 *)cqe64; + u8 *temp_bytewise_xor = (u8 *)(&temp_xor); + u8 cqe_bytewise_xor = (cons_index & 0xff) ^ + ((cons_index & 0xff00) >> 8) ^ + ((cons_index & 0xff0000) >> 16); + + for (i = 0; i < sizeof(struct mlx5_cqe64); i += 8) { + temp_xor ^= *eight_byte_raw_cqe; + eight_byte_raw_cqe++; + } + + for (i = 0; i < (sizeof(u64)); i++) { + cqe_bytewise_xor ^= *temp_bytewise_xor; + temp_bytewise_xor++; + } + + if (cqe_bytewise_xor == 0xff) + return; + + dev_err(&dev->mdev->pdev->dev, + "Faulty CQE - checksum failure: cqe=0x%x cqn=0x%x cqe_bytewise_xor=0x%x\n", + cq->ibcq.cqe, cq->mcq.cqn, cqe_bytewise_xor); + dev_err(&dev->mdev->pdev->dev, + "cons_index=%u arm_sn=%u irqn=%u cqe_size=0x%x\n", + cq->mcq.cons_index, cq->mcq.arm_sn, cq->mcq.irqn, cq->mcq.cqe_sz); + + print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET, + 16, 1, cqe64, sizeof(*cqe64), false); + BUG(); +} + static int mlx5_poll_one(struct mlx5_ib_cq *cq, struct mlx5_ib_qp **cur_qp, struct ib_wc *wc) @@ -463,6 +501,8 @@ static int mlx5_poll_one(struct mlx5_ib_cq *cq, cqe64 = (cq->mcq.cqe_sz == 64) ? cqe : cqe + 64; + verify_cqe(cqe64, cq); + ++cq->mcq.cons_index; /* Make sure we read CQ entry contents after we've checked the diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c index 229728c..f2a6d8b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c @@ -102,6 +102,43 @@ static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn) return cq; } +static void verify_eqe(struct mlx5_eq *eq, struct mlx5_eqe *eqe) +{ + u64 *eight_byte_raw_eqe = (u64 *)eqe; + u8 eqe_bytewise_xor = (eq->cons_index & 0xff) ^ + ((eq->cons_index & 0xff00) >> 8) ^ + ((eq->cons_index & 0xff0000) >> 16); + + int i = 0; + u64 temp_xor = 0; + u8 *temp_bytewise_xor = (u8 *)(&temp_xor); + + for (i = 0; i < sizeof(struct mlx5_eqe); i += 8) { + temp_xor ^= *eight_byte_raw_eqe; + eight_byte_raw_eqe++; + } + + for (i = 0; i < (sizeof(u64)); i++) { + eqe_bytewise_xor ^= *temp_bytewise_xor; + temp_bytewise_xor++; + } + + if (eqe_bytewise_xor == 0xff) + return; + + dev_err(&eq->dev->pdev->dev, + "Faulty EQE - checksum failure: ci=0x%x eqe_type=0x%x eqe_bytewise_xor=0x%x", + eq->cons_index, eqe->type, eqe_bytewise_xor); + + dev_err(&eq->dev->pdev->dev, + "EQ addr=%p eqn=%u irqn=%u vec_index=%u", + eq, eq->eqn, eq->irqn, eq->vecidx); + + print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET, + 16, 1, eqe, sizeof(*eqe), false); + BUG(); +} + static int mlx5_eq_comp_int(struct notifier_block *nb, __always_unused unsigned long action, __always_unused void *data) @@ -127,6 +164,8 @@ static int mlx5_eq_comp_int(struct notifier_block *nb, /* Assume (eqe->type) is always MLX5_EVENT_TYPE_COMP */ cqn = be32_to_cpu(eqe->data.comp.cqn) & 0xffffff; + verify_eqe(eq, eqe); + cq = mlx5_eq_cq_get(eq, cqn); if (likely(cq)) { ++cq->arm_sn;
As PRM defines, the bytewise XOR of the EQE and the EQE index should be 0xff. Otherwise, we can assume we have a corrupt EQE. The same is applicable to CQE as well. Adding a check to verify the EQE and CQE is valid in that aspect and if not, dump the CQE and EQE to dmesg to be inspected. This patch does not introduce any significant performance degradations and has been tested using qperf. Suggested-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Rohit Nair <rohit.sajan.kumar@oracle.com> --- drivers/infiniband/hw/mlx5/cq.c | 40 ++++++++++++++++++++++++++++ drivers/net/ethernet/mellanox/mlx5/core/eq.c | 39 +++++++++++++++++++++++++++ 2 files changed, 79 insertions(+)