From patchwork Fri Feb 1 17:36:15 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bernd Schubert X-Patchwork-Id: 2081341 Return-Path: X-Original-To: patchwork-linux-rdma@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork1.kernel.org (Postfix) with ESMTP id 08FB340E3D for ; Fri, 1 Feb 2013 17:36:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756967Ab3BARgW (ORCPT ); Fri, 1 Feb 2013 12:36:22 -0500 Received: from mailgw1.uni-kl.de ([131.246.120.220]:44000 "EHLO mailgw1.uni-kl.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756890Ab3BARgV (ORCPT ); Fri, 1 Feb 2013 12:36:21 -0500 Received: from itwm2.itwm.fhg.de (itwm2.itwm.fhg.de [131.246.191.3]) by mailgw1.uni-kl.de (8.14.3/8.14.3/Debian-9.4) with ESMTP id r11HaIHc018199 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NOT) for ; Fri, 1 Feb 2013 18:36:19 +0100 Received: from mail2.itwm.fhg.de ([131.246.191.79]:39406) by itwm2.itwm.fhg.de with esmtps (TLSv1:DES-CBC3-SHA:168) (/C=DE/ST=Rheinland-Pfalz/L=Kaiserslautern/O=Fraunhofer ITWM/OU=SLG/CN=mail2.itwm.fhg.de)(verified=1) (Exim 4.74 #1) id 1U1KX9-00064o-QE; Fri, 01 Feb 2013 18:36:15 +0100 Message-ID: <510BFD0F.1060803@itwm.fraunhofer.de> Date: Fri, 01 Feb 2013 18:36:15 +0100 From: Bernd Schubert User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: "Hefty, Sean" CC: "Nikolova, Tatyana E" , "linux-rdma (linux-rdma@vger.kernel.org)" Subject: Re: crash in librdmacm References: <510A954B.9060301@itwm.fraunhofer.de> <1828884A29C6694DAF28B7E6B8A8237357D423D5@ORSMSX101.amr.corp.intel.com> In-Reply-To: <1828884A29C6694DAF28B7E6B8A8237357D423D5@ORSMSX101.amr.corp.intel.com> X-ITWM-CharSet: UTF-8 X-ITWM-Scanned-By: mail2.itwm.fhg.de Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org On 01/31/2013 09:18 PM, Hefty, Sean wrote: >> we got a bug report from a customer that FhGFS server daemons >> frequently crash. Looking into it, we are almost sure it is related >> to librdmacm/kernel-ib. So the crash in librdmacm resolves to: >> >>> (gdb) l *(rdma_get_cm_event+0x102) 0x3d6dc04522 is in >>> rdma_get_cm_event (src/cma.c:1975). 1970 >>> break; 1971 default: 1972 >>> evt->id_priv = (void *) (uintptr_t) resp.uid; 1973 >>> evt->event.id = &evt->id_priv->id; 1974 >>> evt->event.status = resp.status; 1975 if >>> (ucma_is_ud_qp(evt->id_priv->id.qp_type)) 1976 >>> ucma_copy_ud_event(evt, &resp.param.ud); 1977 >>> else 1978 ucma_copy_conn_event(evt, >>> &resp.param.conn); 1979 break; >> >> >> Now the complete shows that the issue comes up directly after >> initiating a connection and we are just querying the file >> descriptor for events. So from our point of view there is not much >> we can do about it. >> >> Possibly this is already fixed by commit >> 418edaaba96e58112b15c82b4907084e2a9caf42 and this commit is also >> not yet in the latest RHEL kernel being used on the customer >> system. However, the commit messages states it is for >> RDMA_CM_EVENT_ESTABLISHED only, while the resolved address above >> points to another event. Does the commit really on fix this event >> or would it also fix further events? > > The above commit fixes an issue where the uid can be returned 0 from > the kernel. > > A call to resolve an address should not have this issue. The issue > is limited to connections formed from a listen request. The reason > for this is that passive side connections create the kernel id first, > which is then linked up with the user space id. Active side > processing creates the user space id, then the kernel id. Hmm, I think there is a misunderstanding. Actually I meant the server side, so indeed the connection and crash are formed from the listen request. Could we somehow also fix or at least catch the issue from librdmacm, e.g. by checking for uid=0, see the patch below? Probably the patch is not sufficient, as the event needs to be deleted somehow, is there a way to do that without further processing it? >> While looking where file->mut is being set, I notice that in >> ucma_create_id() the call of ucma_alloc_ctx() is protected by a >> mutex, but the mutex is immediately given up after that call. In >> ucma_alloc_ctx() the ctx is already added to file->ctx_list, but >> the ctx->uid and ctx->cm_id are then modified outside a mutex in >> ucma_create_id(). Shouldn't the mutex_unlock() be done after >> assigning those values? > > It shouldn't matter, as long as the uid and cm_id are set before any > events can be generated. Events are not generated on newly created > id's. Ah, thanks. Thanks, Bernd rdma_get_cm_event: Check for a 0 uid --- src/cma.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/src/cma.c b/src/cma.c index ff9b426..236cbd9 100755 --- a/src/cma.c +++ b/src/cma.c @@ -1929,6 +1929,8 @@ retry: break; case RDMA_CM_EVENT_CONNECT_REQUEST: evt->id_priv = (void *) (uintptr_t) resp.uid; + if (!evt->id_priv) + return ERR(EINVAL); /* kernel bug? */ if (ucma_is_ud_qp(evt->id_priv->id.qp_type)) ucma_copy_ud_event(evt, &resp.param.ud); else @@ -1941,6 +1943,8 @@ retry: case RDMA_CM_EVENT_CONNECT_RESPONSE: ucma_copy_conn_event(evt, &resp.param.conn); evt->event.status = ucma_process_conn_resp(evt->id_priv); + if (!evt->id_priv) + return ERR(EINVAL); /* kernel bug? */ if (!evt->event.status) evt->event.event = RDMA_CM_EVENT_ESTABLISHED; else { @@ -1949,6 +1953,8 @@ retry: } break; case RDMA_CM_EVENT_ESTABLISHED: + if (!evt->id_priv) + return ERR(EINVAL); /* kernel bug? */ if (ucma_is_ud_qp(evt->id_priv->id.qp_type)) { ucma_copy_ud_event(evt, &resp.param.ud); break; @@ -1957,6 +1963,8 @@ retry: ucma_copy_conn_event(evt, &resp.param.conn); break; case RDMA_CM_EVENT_REJECTED: + if (!evt->id_priv) + return ERR(EINVAL); /* kernel bug? */ if (evt->id_priv->connect_error) { ucma_complete_event(evt->id_priv); goto retry; @@ -1965,6 +1973,8 @@ retry: ucma_modify_qp_err(evt->event.id); break; case RDMA_CM_EVENT_DISCONNECTED: + if (!evt->id_priv) + return ERR(EINVAL); /* kernel bug? */ if (evt->id_priv->connect_error) { ucma_complete_event(evt->id_priv); goto retry; @@ -1974,6 +1984,8 @@ retry: case RDMA_CM_EVENT_MULTICAST_JOIN: evt->mc = (void *) (uintptr_t) resp.uid; evt->id_priv = evt->mc->id_priv; + if (!evt->id_priv) + return ERR(EINVAL); /* kernel bug? */ evt->event.id = &evt->id_priv->id; ucma_copy_ud_event(evt, &resp.param.ud); evt->event.param.ud.private_data = evt->mc->context; @@ -1989,6 +2001,8 @@ retry: break; default: evt->id_priv = (void *) (uintptr_t) resp.uid; + if (!evt->id_priv) + return ERR(EINVAL); /* kernel bug? */ evt->event.id = &evt->id_priv->id; evt->event.status = resp.status; if (ucma_is_ud_qp(evt->id_priv->id.qp_type))