From patchwork Thu Jul 9 13:35:21 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jack Wang X-Patchwork-Id: 6756311 Return-Path: X-Original-To: patchwork-linux-rdma@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id B9A64C05AC for ; Thu, 9 Jul 2015 13:35:31 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id D45142053A for ; Thu, 9 Jul 2015 13:35:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B4EF820527 for ; Thu, 9 Jul 2015 13:35:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753096AbbGINf1 (ORCPT ); Thu, 9 Jul 2015 09:35:27 -0400 Received: from mail-lb0-f181.google.com ([209.85.217.181]:34989 "EHLO mail-lb0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753090AbbGINfW (ORCPT ); Thu, 9 Jul 2015 09:35:22 -0400 Received: by lblf12 with SMTP id f12so5720697lbl.2 for ; Thu, 09 Jul 2015 06:35:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=+gmddql0DWpr7hs38yH3vWlcrGSIi3IxKj8hqKU+QDI=; b=jHjXh/74WaX5jqlhJDMSk7H553L74V1ZpndPfp8dqsAAozkiX5vsHNiDYWuG/aAdmU kIztOoiVeDpw9cNolYvkGf4ukca3jsItMfl0tULDXXRip+Na5m4MRlBrMjdrcJltJXJh QllYBp/DRs/V/fmvs4LJpcodCCjVJuZALsLkCHGnvHrXELirXCO31FSwYFvwOaq/PIQp CUdbX5tqRnuFDMOshjsBD9B8uK2cZspCwzKorRGExIO/kgvi9QDaAjTA9KpJhPS1fM4V WCGSIYjQVFECX/WpZJzpVwKkd0n93RVED+PZLUR5iyb855K/Kkfwr+X0vtp3IiGGDa/4 mdBg== MIME-Version: 1.0 X-Received: by 10.112.146.36 with SMTP id sz4mr15114365lbb.54.1436448921309; Thu, 09 Jul 2015 06:35:21 -0700 (PDT) Received: by 10.25.216.70 with HTTP; Thu, 9 Jul 2015 06:35:21 -0700 (PDT) In-Reply-To: <559E592D.5000201@mellanox.com> References: <559D1562.2070309@mellanox.com> <559D2A80.4040909@mellanox.com> <559E592D.5000201@mellanox.com> Date: Thu, 9 Jul 2015 15:35:21 +0200 Message-ID: Subject: Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210 From: Jack Wang To: Or Gerlitz Cc: "linux-rdma@vger.kernel.org" , Jack Morgenstein , Moni Shoua Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, T_DKIM_INVALID, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP 2015-07-09 13:21 GMT+02:00 Or Gerlitz : > On 7/9/2015 2:14 PM, Jack Wang wrote: >> >> I managed to update the kernel to OFED 3.0 to verify the bug, but I >> can still produce the bug, maybe there're still some synchronice_irq >> is missing? > > > Again, even if you don't use the upstream kernel for production, I suggest > you > try to reproduce the bug there and if it exists we'll try to solve it on > upstream > and later port to MLNX OFED, makes sense?You can start with just the > installed 3.18.14 > > Or. Hello Or, We have other kernel modules together also the autotest infrastructure. It's not that easy to install a 3.18.14 kernel. I look into the code a little bit. I think the bug may relate radix_tree usage in mlx4_cq_free , OFED code in radix_tree_delete before synchronize_irq, but mainline code call radix_tree_delete after synchronize_irq, does this matter? I'm building a new kernel with this small change: wait_for_completion(&cq->free); Thanks, Jack --- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- a/drivers/net/ethernet/mellanox/mlx4/cq.c +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c @@ -393,16 +393,16 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) if (err) mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn); - spin_lock(&cq_table->lock); - radix_tree_delete(&cq_table->tree, cq->cqn); - spin_unlock(&cq_table->lock); - synchronize_irq(priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq); /* synchronize ASYNC irq */ if (priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq != priv->eq_table.eq[MLX4_EQ_ASYNC].irq) synchronize_irq(priv->eq_table.eq[MLX4_EQ_ASYNC].irq); + spin_lock(&cq_table->lock); + radix_tree_delete(&cq_table->tree, cq->cqn); + spin_unlock(&cq_table->lock); + if (atomic_dec_and_test(&cq->refcount)) complete(&cq->free);