Message ID | 1346323085-2665-1-git-send-email-ogerlitz@mellanox.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
Roland Dreier ?<roland@kernel.org> wrote:
> Can you be explicit about the race you're worried about?
few
1. on the time CQ A is deleted an interrupt that relates to CQ B
takes place and a radix
tree lookup is running while an element is being deleted from the
tree, looking on the radix tree API, I don't see that this is allowed.
2. while a CQ is being freed an interrupt takes place and the driver
attempts to run the comp handler which can turn to use after free,
null pointer deref, etc. This can happen even if the ULP made sure to
consume all the WCs related to flushed/etc, e.g an "empty"
interrupt
BTW, your email missed the list somehow
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 30, 2012 at 3:17 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote: > Roland Dreier ?<roland@kernel.org> wrote: > >> Can you be explicit about the race you're worried about? > > few > > 1. on the time CQ A is deleted an interrupt that relates to CQ B > takes place and a radix > tree lookup is running while an element is being deleted from the > tree, looking on the radix tree API, I don't see that this is allowed. I don't think this is a real problem; the radix tree code is explicitly designed for RCU use, and the data structure is pretty clearly safe for looking up one slot while another slot is being cleared. In fact it's hard to see how this could screw up. > 2. while a CQ is being freed an interrupt takes place and the driver > attempts to run the comp handler which can turn to use after free, > null pointer deref, etc. This can happen even if the ULP made sure to > consume all the WCs related to flushed/etc, e.g an "empty" > interrupt So in mlx4_cq_free() we do mlx4_HW2SW_CQ(dev, NULL, cq->cqn); //... synchronize_irq(priv->eq_table.eq[cq->vector].irq); before we touch the cq table. I don't think we should get a CQ completion event for the CQ we're freeing after we've done HW2SW_CQ on it and then waited for any outstanding completion interrupts to finish. Also we know that there are no QPs attached to this CQ so there shouldn't be any completion events anyway... - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 31, 2012 at 1:35 AM, Roland Dreier <roland@kernel.org> wrote: > On Thu, Aug 30, 2012 at 3:17 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote: >> 1. on the time CQ A is deleted an interrupt that relates to CQ B >> takes place and a radix >> tree lookup is running while an element is being deleted from the >> tree, looking on the radix tree API, I don't see that this is allowed. > > I don't think this is a real problem; the radix tree code is > explicitly designed for RCU use, and the data structure is pretty > clearly safe for looking up one slot while another slot is being > cleared. In fact it's hard to see how this could screw up. OK, I'll give it a 2nd look, thanks for elaborating on the design, one thing which probably confused us was the driver cq event callback which does take the lock before searching the radix tree, and increase the refcount before invoking the user event handler, any reason for the driver event callback to use different practice vs. the comp one? >> 2. while a CQ is being freed an interrupt takes place and the driver >> attempts to run the comp handler which can turn to use after free, >> null pointer deref, etc. This can happen even if the ULP made sure to >> consume all the WCs related to flushed/etc, e.g an "empty" interrupt > So in mlx4_cq_free() we do > mlx4_HW2SW_CQ(dev, NULL, cq->cqn); > //... > synchronize_irq(priv->eq_table.eq[cq->vector].irq); > > before we touch the cq table. I don't think we should get a CQ > completion event for the CQ we're freeing after we've done HW2SW_CQ on > it and then waited for any outstanding completion interrupts to finish. yes, makes sense, this wouldn't handle use cases where the user does context switch, e.g from hard_irq to softirq/tasklet and let their handler touch the CQ, but the refcount in the driver wouldn't help either in that case. > Also we know that there are no QPs attached to this CQ so there > shouldn't be any completion events anyway... All to all, your reasoning makes much sense, we probably need to look deeper into the crash report that triggered this RFC, see next email. Yishai - were you thinking on other possible races that the patch could address? also, can you double check the point Roland made on HW2SW_CQ. Or. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Roland Dreier ?<roland@kernel.org> wrote: >> Can you be explicit about the race you're worried about? Roland, This patch was made after we got the below report from the field. Dotan, can we get Roland access to the full vmcore image? Here's the report I got: "[...] box panic'ed when trying to find a completion queue. There is no corruption but there is a possible race which could result in mlx4_cq_completion getting wrong height of the radix tree and following a bit too deep into the chains. In the other code which uses this radix tree the access is protected by the lock but mlx4_cq_completion is running win the interrupt context and cannot take locks, so instead it run without any protection whatsoever." The below is the stack trace, the kernel is an RHEL5 2.6.18-274.something, so in that respect, maybe your comment on the radix tree RCU doesn't take effect there? you can see the problem was somewhere into the radix tree code. crash> bt PID: 8178 TASK: ffff810b91a52400 CPU: 11 COMMAND: "universal_consu" #0 [ffff81182fe3bc70] crash_kexec at ffffffff800ba60c #1 [ffff81182fe3bd30] __die at ffffffff80069047 #2 [ffff81182fe3bd70] die at ffffffff8007075b #3 [ffff81182fe3bda0] do_general_protection at ffffffff80069375 #4 [ffff81182fe3bde0] error_exit at ffffffff80060e9d [exception RIP: radix_tree_lookup+38] RIP: ffffffff8016405f RSP: ffff81182fe3be90 RFLAGS: 00210002 RAX: 6b6b6b6b6b6b6b8b RBX: ffff810c2fb90000 RCX: 0000000000000006 RDX: 0000000000000001 RSI: 00000000000000c0 RDI: 6b6b6b6b6b6b6b6b RBP: 00000000000000c0 R8: 0000000000010000 R9: 000000000920ea94 R10: 00000000000000b3 R11: ffffffff800c7ea5 R12: 00000000000000b3 R13: ffff810c2b15a280 R14: ffff810b7a98ff58 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #5 [ffff81182fe3be90] mlx4_cq_completion at ffffffff886f2367 #6 [ffff81182fe3beb0] mlx4_eq_int at ffffffff886f2bd3 #7 [ffff81182fe3bf10] mlx4_msi_x_interrupt at ffffffff886f3078 #8 [ffff81182fe3bf20] handle_IRQ_event at ffffffff8001167c #9 [ffff81182fe3bf50] __do_IRQ at ffffffff800c7f40 #10 [ffff81182fe3bf90] do_IRQ at ffffffff80071659 --- <IRQ stack> --- #11 [ffff810b7a98ff58] ret_from_intr at ffffffff80060652 RIP: 00000000f5cd8859 RSP: 00000000f3eafa28 RFLAGS: 00200202 RAX: 0000000000000000 RBX: 00000000f3eafaf8 RCX: 000000000b6d4f00 RDX: 000000000000001c RSI: 000000000000001c RDI: 0000000000000000 RBP: ffff810bf1a0a120 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 R13: 0000000000000000 R14: ffff810b9e7d5bf0 R15: 0000000000000000 ORIG_RAX: ffffffffffffff4c CS: 0023 SS: 002b crash> -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 31, 2012 at 1:35 AM, Roland Dreier <roland@kernel.org> wrote:
> I don't think this is a real problem; the radix tree code is [...]
So maybe this patch wouldn't land in 3.7, we'll see, however, so far
no other patch sits in the for-next branch of your tree for the next
window... any plan to start reviewing / picking patches anytime soon?
Also, there are three more fixes for 3.6 I sent you earlier this week,
which need to get in as they address real bugs (deadlock in IPoIB,
crashes in mlx4 when attempting to register > 512GB, more).
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c index 7e64033..05d9529 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cq.c +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c @@ -56,9 +56,14 @@ void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn) { struct mlx4_cq *cq; + struct mlx4_cq_table *cq_table = &mlx4_priv(dev)->cq_table; + spin_lock(&cq_table->lock); cq = radix_tree_lookup(&mlx4_priv(dev)->cq_table.tree, cqn & (dev->caps.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&cq_table->lock); if (!cq) { mlx4_dbg(dev, "Completion event for bogus CQ %08x\n", cqn); return; @@ -67,6 +72,8 @@ void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn) ++cq->arm_sn; cq->comp(cq); + if (atomic_dec_and_test(&cq->refcount)) + complete(&cq->free); } void mlx4_cq_event(struct mlx4_dev *dev, u32 cqn, int event_type)