diff mbox

[RFC,for-next] net/mlx4_core: Fix racy flow in the driver CQ completion handler

Message ID 1346323085-2665-1-git-send-email-ogerlitz@mellanox.com (mailing list archive)
State Rejected
Headers show

Commit Message

Or Gerlitz Aug. 30, 2012, 10:38 a.m. UTC
From: Yishai Hadas <yishaih@mellanox.com>

The mlx4 CQ completion handler, mlx4_cq_completion doesn't bother to lock 
the radix tree which is used to manage the table of CQs, nor to increase
the reference count of the CQ before invoking the user provided callback.

This is racy and can cause use after free, null pointer dereference, etc various 
crashes. Fix that by wrapping the radix tree lookup with taking the table lock, 
and increase the ref count for the time of the callback.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>

---

Hi Roland,

Doing some code review on the related parts, I didn't see any way
the handler can be protected for the tree lockup and the user callback
invokation other then using these methods, e.g as done by the "event"
callback. Looking on mthca, I see the same behaviour -- mthca_cq_completion
takes no locks and doesn't increase the refcount while mthca_cq_event does
go by the book and do what ever needed, anything we miss here? its there
from day one... 

Or.

 drivers/net/ethernet/mellanox/mlx4/cq.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

Comments

Or Gerlitz Aug. 30, 2012, 10:17 p.m. UTC | #1
Roland Dreier ?<roland@kernel.org>  wrote:

> Can you be explicit about the race you're worried about?

few

1. on the  time CQ A is deleted an interrupt that relates to CQ B
takes place and a radix
tree lookup is running while an element is being deleted from the
tree, looking on the radix tree API, I don't see that this is allowed.

2. while a CQ is being freed an interrupt takes place and the driver
attempts to run the comp handler which can turn to use after free,
null pointer deref, etc. This can happen even if the ULP made sure to
consume all the WCs related to flushed/etc, e.g an "empty"
interrupt

BTW, your email missed the list somehow

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier Aug. 30, 2012, 10:35 p.m. UTC | #2
On Thu, Aug 30, 2012 at 3:17 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
> Roland Dreier ?<roland@kernel.org>  wrote:
>
>> Can you be explicit about the race you're worried about?
>
> few
>
> 1. on the  time CQ A is deleted an interrupt that relates to CQ B
> takes place and a radix
> tree lookup is running while an element is being deleted from the
> tree, looking on the radix tree API, I don't see that this is allowed.

I don't think this is a real problem; the radix tree code is
explicitly designed for RCU use, and the data structure is pretty
clearly safe for looking up one slot while another slot is being
cleared.  In fact it's hard to see how this could screw up.

> 2. while a CQ is being freed an interrupt takes place and the driver
> attempts to run the comp handler which can turn to use after free,
> null pointer deref, etc. This can happen even if the ULP made sure to
> consume all the WCs related to flushed/etc, e.g an "empty"
> interrupt

So in mlx4_cq_free() we do

        mlx4_HW2SW_CQ(dev, NULL, cq->cqn);
        //...
        synchronize_irq(priv->eq_table.eq[cq->vector].irq);

before we touch the cq table.  I don't think we should get a CQ
completion event for the CQ we're freeing after we've done HW2SW_CQ on
it and then waited for any outstanding completion interrupts to
finish.

Also we know that there are no QPs attached to this CQ so there
shouldn't be any completion events anyway...

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Aug. 31, 2012, 6:53 a.m. UTC | #3
On Fri, Aug 31, 2012 at 1:35 AM, Roland Dreier <roland@kernel.org> wrote:
> On Thu, Aug 30, 2012 at 3:17 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:

>> 1. on the  time CQ A is deleted an interrupt that relates to CQ B
>> takes place and a radix
>> tree lookup is running while an element is being deleted from the
>> tree, looking on the radix tree API, I don't see that this is allowed.
>
> I don't think this is a real problem; the radix tree code is
> explicitly designed for RCU use, and the data structure is pretty
> clearly safe for looking up one slot while another slot is being
> cleared.  In fact it's hard to see how this could screw up.

OK, I'll give it a 2nd look, thanks for elaborating on the design, one
thing which probably confused us was the driver cq event callback
which does take the lock before searching the radix tree, and increase
the refcount before invoking the user event handler, any reason for
the driver event callback to use different practice vs. the comp one?

>> 2. while a CQ is being freed an interrupt takes place and the driver
>> attempts to run the comp handler which can turn to use after free,
>> null pointer deref, etc. This can happen even if the ULP made sure to
>> consume all the WCs related to flushed/etc, e.g an "empty" interrupt

> So in mlx4_cq_free() we do
>         mlx4_HW2SW_CQ(dev, NULL, cq->cqn);
>         //...
>         synchronize_irq(priv->eq_table.eq[cq->vector].irq);
>
> before we touch the cq table.  I don't think we should get a CQ
> completion event for the CQ we're freeing after we've done HW2SW_CQ on
> it and then waited for any outstanding completion interrupts to finish.

yes, makes sense, this wouldn't handle use cases where the user does
context switch, e.g from hard_irq to softirq/tasklet and let their
handler touch the CQ, but the refcount in the driver wouldn't help
either in that case.

> Also we know that there are no QPs attached to this CQ so there
> shouldn't be any completion events anyway...

All to all, your reasoning makes much sense, we probably need to look
deeper into the crash report that triggered this RFC, see next email.

Yishai - were you thinking on other possible races that the patch
could address? also, can you double check the point Roland made on
HW2SW_CQ.


Or.
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Aug. 31, 2012, 7:02 a.m. UTC | #4
> Roland Dreier ?<roland@kernel.org>  wrote:
>> Can you be explicit about the race you're worried about?

Roland,

This patch was made after we got the below report from the field.

Dotan, can we get Roland access to the full vmcore image?

Here's the report I got:

"[...] box panic'ed when trying to find a completion queue. There is
no corruption but there is a possible race which could result in
mlx4_cq_completion getting wrong height of the radix tree and
following a bit too deep into the chains. In the other code which uses
this radix tree the access is protected by the lock but
mlx4_cq_completion is running win the interrupt context and cannot
take locks, so instead it run without any protection whatsoever."

The below is the stack trace, the kernel is an RHEL5
2.6.18-274.something, so in that respect, maybe your comment on the
radix tree RCU doesn't take effect there? you can see the problem was
somewhere into the radix tree code.

crash> bt
PID: 8178   TASK: ffff810b91a52400  CPU: 11  COMMAND: "universal_consu"
#0 [ffff81182fe3bc70] crash_kexec at ffffffff800ba60c
#1 [ffff81182fe3bd30] __die at ffffffff80069047
#2 [ffff81182fe3bd70] die at ffffffff8007075b
#3 [ffff81182fe3bda0] do_general_protection at ffffffff80069375
#4 [ffff81182fe3bde0] error_exit at ffffffff80060e9d

    [exception RIP: radix_tree_lookup+38]
    RIP: ffffffff8016405f  RSP: ffff81182fe3be90  RFLAGS: 00210002
    RAX: 6b6b6b6b6b6b6b8b  RBX: ffff810c2fb90000  RCX: 0000000000000006
    RDX: 0000000000000001  RSI: 00000000000000c0  RDI: 6b6b6b6b6b6b6b6b
    RBP: 00000000000000c0   R8: 0000000000010000   R9: 000000000920ea94
    R10: 00000000000000b3  R11: ffffffff800c7ea5  R12: 00000000000000b3
    R13: ffff810c2b15a280  R14: ffff810b7a98ff58  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000

#5 [ffff81182fe3be90] mlx4_cq_completion at ffffffff886f2367
#6 [ffff81182fe3beb0] mlx4_eq_int at ffffffff886f2bd3
#7 [ffff81182fe3bf10] mlx4_msi_x_interrupt at ffffffff886f3078
#8 [ffff81182fe3bf20] handle_IRQ_event at ffffffff8001167c
#9 [ffff81182fe3bf50] __do_IRQ at ffffffff800c7f40
#10 [ffff81182fe3bf90] do_IRQ at ffffffff80071659

--- <IRQ stack> ---

#11 [ffff810b7a98ff58] ret_from_intr at ffffffff80060652
    RIP: 00000000f5cd8859  RSP: 00000000f3eafa28  RFLAGS: 00200202
    RAX: 0000000000000000  RBX: 00000000f3eafaf8  RCX: 000000000b6d4f00
    RDX: 000000000000001c  RSI: 000000000000001c  RDI: 0000000000000000
    RBP: ffff810bf1a0a120   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000001
    R13: 0000000000000000  R14: ffff810b9e7d5bf0  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff4c  CS: 0023  SS: 002b
crash>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Aug. 31, 2012, 7:12 a.m. UTC | #5
On Fri, Aug 31, 2012 at 1:35 AM, Roland Dreier <roland@kernel.org> wrote:

> I don't think this is a real problem; the radix tree code is [...]

So maybe this patch wouldn't land in 3.7, we'll see, however, so far
no other patch sits in the for-next branch of your tree for the next
window... any plan to start reviewing / picking patches anytime soon?

Also, there are three more fixes for 3.6 I sent you earlier this week,
which need to get in as they address real bugs (deadlock in IPoIB,
crashes in mlx4 when attempting to register > 512GB, more).

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 7e64033..05d9529 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -56,9 +56,14 @@ 
 void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)
 {
 	struct mlx4_cq *cq;
+	struct mlx4_cq_table *cq_table = &mlx4_priv(dev)->cq_table;
 
+	spin_lock(&cq_table->lock);
 	cq = radix_tree_lookup(&mlx4_priv(dev)->cq_table.tree,
 			       cqn & (dev->caps.num_cqs - 1));
+	if (cq)
+		atomic_inc(&cq->refcount);
+	spin_unlock(&cq_table->lock);
 	if (!cq) {
 		mlx4_dbg(dev, "Completion event for bogus CQ %08x\n", cqn);
 		return;
@@ -67,6 +72,8 @@  void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)
 	++cq->arm_sn;
 
 	cq->comp(cq);
+	if (atomic_dec_and_test(&cq->refcount))
+		complete(&cq->free);
 }
 
 void mlx4_cq_event(struct mlx4_dev *dev, u32 cqn, int event_type)