From patchwork Wed Mar 26 20:52:02 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alex Elder X-Patchwork-Id: 3895281 Return-Path: X-Original-To: patchwork-ceph-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork2.web.kernel.org (Postfix) with ESMTP id BC8B8BF540 for ; Wed, 26 Mar 2014 20:51:50 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id EC3BE201BA for ; Wed, 26 Mar 2014 20:51:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 9C5ED2017B for ; Wed, 26 Mar 2014 20:51:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755422AbaCZUvq (ORCPT ); Wed, 26 Mar 2014 16:51:46 -0400 Received: from mail-ie0-f169.google.com ([209.85.223.169]:42821 "EHLO mail-ie0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755274AbaCZUvp (ORCPT ); Wed, 26 Mar 2014 16:51:45 -0400 Received: by mail-ie0-f169.google.com with SMTP id to1so2309744ieb.28 for ; Wed, 26 Mar 2014 13:51:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=Nmvd7PgVpXf9Nuy+myR1qeE0W4lBDUjrcfmQw+7AsYk=; b=FLN3D/hZ56h4dZO2YRy3UmnjH8oVpje+KCEwEQlVC3t+NKNczJblqVf0tiY7nxCecH k5eS3v/0SLubbwzIS0vgQJMsAhz5Q9WPb0mqNy8jcrICJCtDBCYEihlicetup5rjjDHc 0J+PxNVVago2Cx0xk1qUY3IoCdXzMen8MqSgJmwXSgbCcHCxsELc+/B9pasA0a5/JppA jz5kNKjhOHtuKQNkHtY51KYf0DLJa7akNecyMysEacwtdiYM7P4W28nAUDmNiAYV8M7L 8skI5zX4/v1C+NoDiMeNCOlU7rup5aa+APPgH/rqjmZO+KYK88b9H4J75sHrhXZN/mq0 2qHw== X-Gm-Message-State: ALoCoQnEs/q9Z+soyXP/bHQUaMHTi7JNUAoYeurRS5mtaKDOSVoKykSDZNzh+erkcjk3E8kW2NtY X-Received: by 10.50.92.98 with SMTP id cl2mr26904091igb.34.1395867105501; Wed, 26 Mar 2014 13:51:45 -0700 (PDT) Received: from localhost.localdomain (c-71-195-31-37.hsd1.mn.comcast.net. [71.195.31.37]) by mx.google.com with ESMTPSA id dz8sm715335igb.5.2014.03.26.13.51.44 for (version=TLSv1.1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 26 Mar 2014 13:51:44 -0700 (PDT) From: Alex Elder To: ceph-devel@vger.kernel.org Cc: ilya.dryomov@inktank.com, ob@daevel.fr Subject: [PATCH] rbd: drop an unsafe assertion Date: Wed, 26 Mar 2014 15:52:02 -0500 Message-Id: <1395867122-9340-1-git-send-email-elder@linaro.org> X-Mailer: git-send-email 1.7.9.5 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Olivier Bonvalet reported having repeated crashes due to a failed assertion he was hitting in rbd_img_obj_callback(): Assertion failure in rbd_img_obj_callback() at line 2165: rbd_assert(which == img_request->next_completion); With a lot of help from Olivier with reproducing the problem we were able to determine the object and image requests had already been completed (and often freed) at the point the assertion failed. There was a great deal of discussion on the ceph-devel mailing list about this. The problem only arose when there were two (or more) object requests in an image request, and the problem was always seen when the second request was being completed. The problem is due to a race in the window between setting the "done" flag on an object request and checking the image request's next completion value. When the first object request completes, it checks to see if its successor request is marked "done", and if so, that request is also completed. In the process, the image request's next_completion value is updated to reflect that both the first and second requests are completed. By the time the second request is able to check the next_completion value, it has been set to a value *greater* than its own "which" value, which caused an assertion to fail. Fix this problem by skipping over any completion processing unless the completing object request is the next one expected. Test only for inequality (not >=), and eliminate the bad assertion. Tested-by: Olivier Bonvalet Signed-off-by: Alex Elder --- drivers/block/rbd.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index f044fab..4c95b50 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2125,10 +2125,9 @@ static void rbd_img_obj_callback(struct rbd_obj_request *obj_request) rbd_assert(which < img_request->obj_request_count); spin_lock_irq(&img_request->completion_lock); - if (which > img_request->next_completion) + if (which != img_request->next_completion) goto out; - rbd_assert(which == img_request->next_completion); for_each_obj_request_from(img_request, obj_request) { rbd_assert(more); rbd_assert(which < img_request->obj_request_count);