From patchwork Tue Mar 25 13:18:36 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Olivier Bonvalet X-Patchwork-Id: 3887551 Return-Path: X-Original-To: patchwork-ceph-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 7DFCCBF540 for ; Tue, 25 Mar 2014 13:18:45 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 9A74C2021C for ; Tue, 25 Mar 2014 13:18:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 39F64201F5 for ; Tue, 25 Mar 2014 13:18:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753407AbaCYNSm (ORCPT ); Tue, 25 Mar 2014 09:18:42 -0400 Received: from licorne.daevel.fr ([178.32.94.222]:35863 "EHLO licorne.daevel.fr" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752070AbaCYNSl (ORCPT ); Tue, 25 Mar 2014 09:18:41 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=daevel.fr; s=default; h=Content-Transfer-Encoding:Mime-Version:Content-Type:References:In-Reply-To:Date:Cc:To:From:Subject:Message-ID; bh=gPQtiR7tK1fC06DZAyH5b8pVTXoQbQsVZ+wkWVgFReY=; b=LxqXCXR6kkwxbZk7xnNju24tNb8Ohq7MX7K5VMH6xJG9mnH6v//94KToZcip3im8+XdERCh8rsHiqXtsu5CnUa0nI9HVCNiluc2xZOYyvEN+i8/rZ8YzehRguo7i/P2c; Received: from local.plusdinfo.com ([82.232.160.30] helo=[192.168.0.10]) by licorne.daevel.fr with esmtpsa (SSL3.0:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from ) id 1WSRFV-0007oO-DH; Tue, 25 Mar 2014 14:18:37 +0100 Message-ID: <1395753516.2823.37.camel@localhost> Subject: Re: Issue #5876 : assertion failure in rbd_img_obj_callback() From: Olivier Bonvalet To: Ilya Dryomov Cc: Alex Elder , Ceph Development Date: Tue, 25 Mar 2014 14:18:36 +0100 In-Reply-To: References: <1395736765.2823.29.camel@localhost> <53316D18.7040103@ieee.org> <53317BC2.9010700@ieee.org> X-Mailer: Evolution 3.8.5-2+b3 Mime-Version: 1.0 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Spam-Status: No, score=-7.2 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD,T_DKIM_INVALID,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Le mardi 25 mars 2014 à 14:57 +0200, Ilya Dryomov a écrit : > On Tue, Mar 25, 2014 at 2:51 PM, Alex Elder wrote: > > On 03/25/2014 07:34 AM, Ilya Dryomov wrote: > >>> On 03/25/2014 04:04 AM, Ilya Dryomov wrote: > >>>> On Tue, Mar 25, 2014 at 10:39 AM, Olivier Bonvalet wrote: > >>>>> Hi, > >>>>> > >>>>> what can/should I do to help fix that problem ? > >>>>> > >>>>> for now, RBD kernel client hang on : > >>>>> Assertion failure in rbd_img_obj_callback() at line 2131: > >>>>> rbd_assert(which >= img_request->next_completion); > >>> > >>> If you can build your own kernel as Ilya says I'd like to > >>> see the values of which and img_request->next_completion > >>> here. > >> > >> Looks like which was 1, which means that next_completion had to be 2 or > >> greater. I miss solaris crash dumps ... > >> > >> On a different note, why are we asserting next_completion outside of > >> a spinlock which is supposed to protect next_completion? > > > > That's a very good point (which could be easily remedied by moving > > the assertion down a couple lines). The image object request (#1) > > in this case will have been marked done at this point; it's possible > > that request #2 (or later) was concurrently getting handled by the > > for_each_obj_request_from() loop below in that same function, but > > may not have updated next_completion yet. > > > > So that *could* explain the tripped assertion. The assertion > > should be moved in any case, it's a bug. > > > > That being said, it doesn't explain the other assertion: > > rbd_assert(img_request != NULL); > > So there's at least one other thing going on. > > Yeah, exactly my thoughts. > > Thanks, > > Ilya So, a (partial) fix can be this patch ? --- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2123,6 +2123,7 @@ static void rbd_img_obj_callback(struct rbd_obj_request *obj_request) rbd_assert(obj_request_img_data_test(obj_request)); img_request = obj_request->img_request; + spin_lock_irq(&img_request->completion_lock); dout("%s: img %p obj %p\n", __func__, img_request, obj_request); rbd_assert(img_request != NULL); rbd_assert(img_request->obj_request_count > 0); @@ -2130,7 +2131,6 @@ static void rbd_img_obj_callback(struct rbd_obj_request *obj_request) rbd_assert(which < img_request->obj_request_count); rbd_assert(which >= img_request->next_completion); - spin_lock_irq(&img_request->completion_lock); if (which != img_request->next_completion) goto out;